Building with the spack-stack unified environment on non-production machines

DavidHuber-NOAA commented 1 year ago

With the upgrade to HDF5 version 1.14.0 ongoing (#563), this would be a good time to move to the spack-stack unified environment, which also incorporates this version of HDF5 and netCDF version 4.9.2. Some additions are needed to the unified environment (in particular, ncio, ncdiag, and bufr), but these can be added manually as they already exist in spack. This is also desired for the upcoming UFS SRW release, which will incorporate the GSI.

DavidHuber-NOAA commented 1 year ago

I have established a testing environment on Hera referencing a test build of the spack-stack unified environment. So far, this is working well. All regression tests run to completion. I am running the regression tests against the branch to be merged in #571. There are some differences between the two, with the hwrf, global_4denvar, global_3denvar, global_3dvar, and rtma tests all failing cost and/or penalty comparison tests, with values differencing in the 1-5% range on the first iteration.

One potential source for differences is a newer version of the BUFR library. The spack-stack test install is running version 12.0.0 versus the hpc-stack-based install that is compiled against version 11.7.0. Running with the older version with spack-stack would determine if this is a source of difference.

DavidHuber-NOAA commented 1 year ago

@arunchawla-NOAA @RussTreadon-NOAA @hu5970 @ulmononian @mark-a-potts I gave regression tests on Hera with spack-stack version 1.4.1 + ncdiag/1.1.1 and bufr/11.7.0. The large differences that I was seeing with bufr/12.0.0 are now on the order of 10^4 smaller, so it seems that the new version of bufr is producing quite different results. Is it known if this difference is expected? And if so, would it be desirable to upgrade to bufr/12.0.0 at this time or should I continue incorporating spack-stack with bufr/11.7.0?

RussTreadon-NOAA commented 1 year ago

Dave ran ctests on Hera using develop built with bufr/11.7.0 and develop built with bufr/12.0.0. He reported large differences in the minimization.

Output from the loproc global_4denvar for updat (bufr/12.0.0) and contrl (bufr11.7.0) was examined. Initial differences are limited to prepbufr temperature observations.

The initial temperature (obs-ges) innovation statistics (fort.203) differ for prepbufr report types 120, 132, 180, and 182. These report types are virtual temperatures. Bias and rms are larger for observations closer to the surface. Differences decrease moving up in the atmosphere. Innovation statistics for report types 130, 131, 133, 134, 136, and 136 are identical. These report types are sensible temperatures.

This suggests something odd happens when processing virtual temperature observations with bufr/12.0.0.

Prints added to src/gsi/read_prepbufr.f90 reveal that when gsi.x is built with bufr/12.0.0 the following call to ufbqcd

! Obtain program code (VTCD) associated with "VIRTMP" step
  call ufbqcd(lunin,'VIRTMP',vtcd)

returns vctd=0.395253-322 for sonde temperature obs. (I used g13.6 formatting and 0.395253-322 is printed to stdout)

Recompiling the code with bufr/11.7.0 and running the same case returns vtcd=8.00000 for sonde temperature obs.

Do we need to change or modify the call to ufbqcd or is a different code change needed when moving to bufr/12.0.0?

RussTreadon-NOAA commented 1 year ago

Write vtcd without formatting. bufr/12.0.0 run has 3.952525166729972E-323.

Replace

 ! Obtain program code (VTCD) associated with "VIRTMP" step
  call ufbqcd(lunin,'VIRTMP',vtcd)

with

 ! Obtain program code (VTCD) associated with "VIRTMP" step
  call ufbqcd(lunin,'VIRTMP',ivtcd)
  vtcd = ivtcd

in src/gsi/read_prepbufr.f90. Recompile gsi.x with bufr/12.0.0 and run global_3dvar ctest. This time get expected value of 8.00000000000000 for vtcd.

The global_3dvar ctest failed

Hera(hfe07):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build$ ctest -R global_3dvar
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build
    Start 1: global_3dvar
1/1 Test #1: global_3dvar .....................***Failed  2039.84 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) = 2039.92 sec

The following tests FAILED:
          1 - global_3dvar (Failed)

This failure, however, was due to the maximum allowable threshold time check. The updat bufr/12.0.0 gsi.x ran slower than the contrl bufr/11.7.0 gsi.x.

global_3dvar_hiproc_contrl/stdout:The total amount of wall time                        = 321.045029
global_3dvar_hiproc_updat/stdout:The total amount of wall time                        = 358.756143
global_3dvar_loproc_contrl/stdout:The total amount of wall time                        = 474.135176
global_3dvar_loproc_updat/stdout:The total amount of wall time                        = 512.837189

This may reflect the additional prints in the bufr/12.0.0 read_prepbufr.f90.

Comparison of the analysis results shows that the bufr/12.0.0 run reproduces the bufr/11.7.0 run. This result is reasonable. Changing the bufr library version should not change analysis results.

Not sure why replacing a real(8) argument with an integer(4) argument was needed in this call to ufbqcd. Similar changes were not made elsewhere. Thus, not confident the above change is the best change.

My working directories on Hera are

updat with bufr/12.0.0 in /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12
contrl with bufr/11.7.0 in /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/develop

global_3dvar ctest results are in /scratch1/NCEPDEV/stmp2/Russ.Treadon/bufr12/tmpreg_global_3dvar

Other ctests have not yet been run with the above change.

RussTreadon-NOAA commented 1 year ago

Hera ctests with bufr/12.0.0

Hera(hfe07):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build$ ctest -j 9
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #8: netcdf_fv3_regional ..............***Failed  974.84 sec
2/9 Test #5: hwrf_nmm_d3 ......................***Failed  992.72 sec
3/9 Test #7: rrfs_3denvar_glbens ..............   Passed  1037.49 sec
4/9 Test #4: hwrf_nmm_d2 ......................   Passed  1039.35 sec
5/9 Test #9: global_enkf ......................***Failed  1202.61 sec
6/9 Test #6: rtma .............................***Failed  2002.66 sec
7/9 Test #3: global_4denvar ...................***Failed  2224.99 sec
8/9 Test #2: global_4dvar .....................   Passed  2230.50 sec
9/9 Test #1: global_3dvar .....................***Failed  2392.88 sec

33% tests passed, 6 tests failed out of 9

Total Test time (real) = 2393.03 sec

The following tests FAILED:
          1 - global_3dvar (Failed)
          3 - global_4denvar (Failed)
          5 - hwrf_nmm_d3 (Failed)
          6 - rtma (Failed)
          8 - netcdf_fv3_regional (Failed)
          9 - global_enkf (Failed)

All failures except rtma were due to timing checks. Apart from rtma, bufr/12.0.0 analyses are identical with their bufr/11.7.0 counterparts. For several failed tests, bufr/12.0.0 executable ran slower than bufr/11.7.0 executable for one or more of the configurations. For example, global_3dvar

tmpreg_global_3dvar/global_3dvar_hiproc_contrl/stdout:The total amount of wall time                        = 326.543222
tmpreg_global_3dvar/global_3dvar_hiproc_updat/stdout:The total amount of wall time                        = 359.565145
tmpreg_global_3dvar/global_3dvar_loproc_contrl/stdout:The total amount of wall time                        = 467.854069
tmpreg_global_3dvar/global_3dvar_loproc_updat/stdout:The total amount of wall time                        = 509.382080

and global_4denvar

tmpreg_global_4denvar/global_4denvar_hiproc_contrl/stdout:The total amount of wall time                        = 373.331819
tmpreg_global_4denvar/global_4denvar_hiproc_updat/stdout:The total amount of wall time                        = 371.220896
tmpreg_global_4denvar/global_4denvar_loproc_contrl/stdout:The total amount of wall time                        = 385.547573
tmpreg_global_4denvar/global_4denvar_loproc_updat/stdout:The total amount of wall time                        = 434.971107

The rtma failure shows different initial innovation statistics for temperature report types 197 and 199. Find another call ufbqcd in read_prepbufr.f90

 !see if file contains GLERL program code (GLCD)
 !Obtain code if it exists.  Otherwise set to missing (-999)
   call status(lunin,lindx,idummy1,idummy2)
   call nemtab(lindx,'GLERL',idummy1,cdummy,glret)
   if (glret /= 0) then
     call ufbqcd(lunin,'GLERL',glcd)

As a test, replace glcd with iglcd in calling list and set glcd to iglcd after call

 !see if file contains GLERL program code (GLCD)
 !Obtain code if it exists.  Otherwise set to missing (-999)
   call status(lunin,lindx,idummy1,idummy2)
   call nemtab(lindx,'GLERL',idummy1,cdummy,glret)
   if (glret /= 0) then
     call ufbqcd(lunin,'GLERL',iglcd)
     glcd = iglcd

Recompile and rerun rtma test. This time the rtma test passed.

Hera(hfe07):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build$ ctest -R rtma
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build
    Start 6: rtma
1/1 Test #6: rtma .............................   Passed  1278.39 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 1278.40 sec

Stop here and let others follow up / investigate more.

DavidHuber-NOAA commented 1 year ago

Also tagging @jbathegit @edwardhartnett @wx20jjung @dtkleist for awareness.

arunchawla-NOAA commented 1 year ago

@DavidHuber-NOAA and @RussTreadon-NOAA thanks for your tests on this! For now I would focus on building and testing with spack-stack that has bufr/11.7.0 I am going to bring this to the attention of @jbathegit @jack-woollen and @edwardhartnett to see if they can throw more light on Russ's results

jbathegit commented 1 year ago

It looks like you figured the issue out on your own. This was indeed a planned change within v12.0.0, as noted in the release notes...

Subroutines ufbqcd() and ufbqcp() have been modified to pass call arguments containing event program codes as integers, rather than continuing to pass them as real numbers. [Issue #78]

This was done intentionally as part of the simplification down to one library build. Note that in previous versions of ufbqcd, the 3rd argument returned was just a real, not an explicit real*8. So it was only a real*8 in the _8 and _d builds because of the compile flags for those builds, and not because of some intrinsic need within the library itself. The return value of the 3rd argument is never larger than 255 so it fits easily inside of any real, and in fact it's always a whole number so there's really no reason not to return it as an integer and make everything simpler.

jack-woollen commented 1 year ago

I feel the need to contribute my two cents in supporting backward compatibility with downstream codes as an important principle in software management. Considering reasons not to make such a change as this, seems to me there's not a good enough compensating reason to do it.

jbathegit commented 1 year ago

I hear what you're saying Jack, and I totally agree that backwards compatibility is an important principle in software management. But in this case, this is no different than where we've already agreed that _8 users will now need to make a change to their application codes to add a call to setim8b, and to make sure that all calls to library functions which return integers are now explicitly declared as integer*4.

In other words, with a major release such as this (where the X of the version X.Y.Z is changing), it's understood that downstream users may need to make some adjustments to their application codes, and this ufbqcd change is just another example of that, and we've already documented this in the accompanying release notes for _8 and _d users.

Note also that, if we hadn't included this ufbqcd change, then as software managers we'd have had to instead develop new internal routines similar to x48 and x84, but for reals instead of integers, since this routine (along with its sister routine ufbqcp) were the only cases of user-callable routines where a real argument was being passed that wasn't already explicitly declared as real*8. And we couldn't have just changed the library to now make that 3rd argument an explicit real*8, because doing so would have similarly impacted any existing _4 users. So bottom line, no matter what we did here, it would have impacted some of our users, and this is just one of the consequences of our earlier decision to eliminate the previous _8 and _d builds and consolidate everything down to a single _4 build.

jbathegit commented 1 year ago

On a separate but related note, I did note the discussion where some of the spack-stack test cases appear to be running slower with the new 12.0.0 build than with the previous 11.7.0 build. That's definitely not a feature that we intentionally introduced into the new build, nor is it something that we saw in our own regression testing for the library, so if it's still an issue of concern then we'd definitely like to better understand what part(s) of the library may be responsible. Is there any way that some sort of profiling tool might better pinpoint where those differences are (e.g. in which library functions or lines of code) for your test cases?

DavidHuber-NOAA commented 1 year ago

I ran regression tests with spack-stack/1.4.1 on Hera against the current develop branch with the following results:

1/9 Test #7: rrfs_3denvar_glbens ..............   Passed  1293.51 sec
2/9 Test #4: hwrf_nmm_d2 ......................***Failed  1592.51 sec
3/9 Test #5: hwrf_nmm_d3 ......................   Passed  1601.61 sec
4/9 Test #9: global_enkf ......................   Passed  1616.37 sec
5/9 Test #8: netcdf_fv3_regional ..............***Failed  1651.17 sec
6/9 Test #6: rtma .............................   Passed  2195.24 sec
7/9 Test #3: global_4denvar ...................***Failed  2551.00 sec
8/9 Test #2: global_4dvar .....................***Failed  2608.58 sec
9/9 Test #1: global_3dvar .....................***Failed  2730.67 sec

Looking into the netcdf_fv3_regional case, differences first appear in initial radiance penalties. This seems like a CRTM issue, though I am having trouble tracking it down.

The CRTM directories are hpc-stack: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/crtm/2.4.0 spack-stack: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/crtm-2.4.0-wpiygpl

The CRTM fix directories are hpc-stack: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/crtm/2.4.0/fix spack-stack: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/crtm-fix-2.4.0_emc-zvsufvh

The contents of the fix directories are identical. The hpc-stack source code for CRTM is located here: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/crtm-v2.4.0, which is identical to the version cloned by spack-stack (github.com/JCSDA/crtm version 2.4.0; hash 5ddd0d6).

So there doesn't seem to be an obvious difference in CRTM. The library versions do differ between the two stacks. In particular, hdf5/netcdf 1.14.0/4.9.2 and nemsio in spack-stack is still at 2.5.2 and includes w3nco, but I don't think these should have an impact.

hpc-stack

  1) miniconda3/4.12.0    9) netcdf/4.7.4  17) nemsio/2.5.4
  2) hpc/1.2.0           10) bufr/11.7.0   18) wrf_io/1.2.0
  3) intel/2022.1.2      11) bacio/2.4.1   19) ncio/1.1.2
  4) hpc-intel/2022.1.2  12) w3emc/2.9.2   20) crtm/2.4.0
  5) impi/2022.1.2       13) sp/2.3.3      21) ncdiag/1.1.1
  6) hpc-impi/2022.1.2   14) ip/3.3.3      22) gsi_common
  7) cmake/3.20.1        15) sigio/2.3.2   23) prod_util/1.2.2
  8) hdf5/1.10.6         16) sfcio/1.4.1   24) gsi_hera.intel

spack-stack

  1) intel/2022.1.2                   12) zstd/1.5.2            23) nemsio/2.5.2
  2) stack-intel/2021.5.0             13) netcdf-c/4.9.2        24) wrf-io/1.2.0
  3) impi/2022.1.2                    14) netcdf-fortran/4.6.0  25) ncio/1.1.2
  4) stack-intel-oneapi-mpi/2021.5.1  15) bufr/11.7.0           26) crtm-fix/2.4.0_emc
  5) miniconda/3.9.12                 16) bacio/2.4.1           27) git-lfs/2.10.0
  6) stack-python/3.9.12              17) w3emc/2.9.2           28) crtm/2.4.0
  7) cmake/3.23.1                     18) sp/2.3.3              29) gsi-ncdiag/1.1.1
  8) curl/7.29.0                      19) ip/3.3.3              30) gsi_common
  9) pkg-config/0.27.1                20) sigio/2.3.2           31) prod-util/1.2.2
 10) zlib/1.2.13                      21) sfcio/1.4.1           32) gsi_hera.intel
 11) hdf5/1.14.0                      22) w3nco/2.4.1

@RussTreadon-NOAA Would you mind taking a look at the RTs on Hera? My regression test directory is in /scratch1/NCEPDEV/nems/David.Huber/GSI/gsi_spackstack/build/regression and the working directories are in /scratch1/NCEPDEV/stmp/David.Huber/ptmp_spack.

RussTreadon-NOAA commented 1 year ago

A check of the run directories shows the crtm coefficients to be identical between the updat and contrl runs.

A check of fort.207 shows initial radiance penalty differences between the updat and contrl for the following satellite_sensors: amsua_n15, amsua_n18, amsua_n19, atms_npp, atms_n20. These are all microwave sensors. However, other microwave satellite_sensors have identical initial penalties between the updat and contrl (e.g., mhs_n19, amsua_metop-b, amsua_metop-c, mhs_metop-c).

Are the same compiler flags used to build the hpc-stack and spack-stack crtm libraries?

DavidHuber-NOAA commented 1 year ago

@AlexanderRichert-NOAA Could you point me to the build log for spack-stack/1.4.1 CRTM/2.4.0 on Hera? I'm seeing some differences between it and the hpc-stack version.

AlexanderRichert-NOAA commented 1 year ago

Sure-- /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/crtm-2.4.0-wpiygpl/.spack/spack-build-out.txt (and there's other build-related info in that directory)

AlexanderRichert-NOAA commented 1 year ago

I won't look at it any further unless you'd like me to, but for what it's worth the difference that stands out to me is the CMAKE_BUILD_TYPE (hpc-stack=Release, spack-stack=RelWithDebInfo), and we've seen some differences in other codes based on that setting.

DavidHuber-NOAA commented 1 year ago

Thanks @AlexanderRichert-NOAA, and thanks @RussTreadon-NOAA for the question. A few obvious differences I see are that

hpc-stack builds with MPI support while spack-stack does not, though both are built with openmp support
hpc-stack builds with -O3 while spack-stack builds with -O2
hpc-stack enables loop unrolling (-unroll flag)
hpc-stack enables inline functions (-inline flag)

So optimizing with -O2 instead of -O3 is a probable culprit. I will try building my own CRTM with spack-stack libraries and CMAKE_BUILD_TYPE=Release to see if that resolves the differences.

DavidHuber-NOAA commented 1 year ago

After building the CRTM with spack-stack libraries and CMAKE_BUILD_TYPE=Release on Hera, I rebuilt the GSI with spack-stack libraries and this custom CRTM then ran regression tests against develop with the following results

1/9 Test 9: global_enkf ......................   Passed  881.88 sec
2/9 Test 8: netcdf_fv3_regional ..............   Passed  909.52 sec
3/9 Test 7: rrfs_3denvar_glbens ..............***Failed  913.31 sec
4/9 Test 5: hwrf_nmm_d3 ......................   Passed  919.42 sec
5/9 Test 4: hwrf_nmm_d2 ......................   Passed  975.22 sec
6/9 Test 2: global_4dvar .....................***Failed  1748.62 sec
7/9 Test 6: rtma .............................***Failed  1938.87 sec
8/9 Test 3: global_4denvar ...................***Failed  1992.34 sec
9/9 Test 1: global_3dvar .....................***Failed  2111.99 sec

The RRFS and RTMA tests failed scalability checks. The global_4denvar and global_3dvar both failed memthresh and maxmem tests, both by a factor of ~2 over the maximum allowable memory:

The memory for global_4denvar_loproc_updat is 5188292 KBs.  This has exceeded maximum allowable hardware memory limit of 2516582 KBs, resulting in Failure maxmem of the regression test.
The memory for global_3dvar_loproc_updat is 4241260 KBs.  This has exceeded maximum allowable hardware memory limit of 2516582 KBs, resulting in Failure maxmem of the regression test.

I'm not sure if this is fatal or not, but regardless it does seem like a very large increase in memory. I'm also not really sure how to debug this.

Lastly, for the global_4dvar test, the contrl and updat siginc files differed. Running ncdump on both files and comparing values revealed the relative differences are on the order of 10^-6 for multiple variables. I verified that the penalties for both runs were identical, which suggests that these differences are related to the netCDF writes. Knowing that netCDF 4.9.2 allows for lossy compression, I checked the file sizes, but they are very similar (hpc-stack: 39487168 vs spack-stack: 39484152) and I believe this option needs to be explicitly invoked, so that doesn't seem to be the culprit. To verify if this is a netCDF 4.9.2 issue, I will run the regression tests again using hpc-stack with netcdf/4.9.2.

DavidHuber-NOAA commented 1 year ago

Compiling the develop branch with netcdf/4.9.2 did not have an effect on the global_4dvar siginc files. The differences are still there. I will need to dig in and find which other libraries are involved in the creation of these files -- probably ncio and possibly w3emc.

DavidHuber-NOAA commented 1 year ago

Rebuilding the the ncio library with spack-stack-built netcdf/4.9.2 and hpc-stack build options (Release, instead of the spack-stack RelWithDebInfo) resulted in identical global_4denvar siginc files between the spack-stack and the develop GSI branches.

I believe the mysteries are solved now. Optimization differences between spack-stack and hpc-stack result is some small numerical differences, particularly for the crtm and ncio libraries. I will go ahead and open a PR to bring these changes in.

AlexanderRichert-NOAA commented 1 year ago

For whatever it's worth, that RelWithDebInfo will go away in the next spack-stack release, as spack recently changed the default to Release.

DavidHuber-NOAA commented 1 year ago

That's good to know. Thanks @AlexanderRichert-NOAA!

DavidHuber-NOAA commented 1 year ago

I found an issue in gsi-ncdiag where allocating the HDF5 chunk size when opening a netCDF file in append mode to 16GB causes maxmem failures. This is a problem with HDF5 1.14.0, but not 1.10.6. A new version of gsi-ncdiag will need to be installed on all platforms under spack-stack to resolve this issue. https://github.com/NOAA-EMC/GSI-ncdiag/issues/7.

DavidHuber-NOAA commented 1 year ago

@jbathegit @RussTreadon-NOAA @aerorahul I made Russ's fixes for BUFR v12.0.0 within read_prepbufr.f90 and reran regression tests against develop. Also included in this round of regression tests, CRTM and NCIO were built using Release rather than the spack-stack 'RelWithDebInfo' to maintain consistent results. The global_3dvar and global_4denvar tests failed for time with the following reported runtimes (apologies for losing the global_3dvar_hiproc_contrl time):

Test	Time Threshold	Runtime
global_4denvar_loproc_contrl	406.458776	363.714842
global_4denvar_loproc_updat	406.458776	406.554885
global_4denvar_hiproc_contrl	321.898368	285.175113
global_4denvar_hiproc_updat	321.898368	327.356095
global_3dvar_loproc_contrl	519.659476	467.417641
global_3dvar_loproc_updat	519.659476	517.557172
global_3dvar_hiproc_contrl	330.294649	--missing--
global_3dvar_hiproc_updat	330.294649	338.319537

The global_4dvar test also failed for different results in the siginc files. However, the differences are only in the HDF5 headers, and all data and netCDF4 headers are identical (known difference).

I then compiled BUFR 12.0.0 with Release (mimicking hpc-stack) and reran the same two tests. Now, the runtimes are much improved and only the global_4denvar_hiproc_updat failed its timethresh test by 0.2 seconds:

Test	Time Threshold	Runtime
global_4denvar_loproc_contrl	406.458776	369.507979
global_4denvar_loproc_updat	406.458776	393.796189
global_4denvar_hiproc_contrl	321.898368	292.634880
global_4denvar_hiproc_updat	321.898368	322.139538
global_3dvar_loproc_contrl	519.659476	472.417706
global_3dvar_loproc_updat	519.659476	510.610903
global_3dvar_hiproc_contrl	330.294649	300.267863
global_3dvar_hiproc_updat	330.294649	324.827100

Lastly, to establish a baseline, I ran just the global_4denvar and global_3dvar tests with BUFR 11.7.0 (built with Release for consistency), which resulted in the following times:

Test	Time Threshold	Runtime
global_4denvar_loproc_contrl	406.458776	370.379223
global_4denvar_loproc_updat	406.458776	367.325278
global_4denvar_hiproc_contrl	321.898368	294.912487
global_4denvar_hiproc_updat	321.898368	295.425141
global_3dvar_loproc_contrl	519.659476	467.687210
global_3dvar_loproc_updat	519.659476	470.499224
global_3dvar_hiproc_contrl	330.294649	301.406168
global_3dvar_hiproc_updat	330.294649	303.492537

So it does appear that the upgrade to BUFR 12.0.0 slows processing by 20-40 seconds when fully optimized.

Next, I would like to run a profiling tool on the GSI to identify the hot BUFR spots, but I am trying to find a good profiler to use. Compiling with -p is causing a crash when trying to read netCDF files in parallel. Do you have a better suggestion @jbathegit?

jbathegit commented 1 year ago

Thanks @DavidHuber-NOAA for this update!

To answer your question, no unfortunately I don't have any better suggestion on how to narrow this down, because I don't have any personal experience with profiling tools. From a code management standpoint, I would certainly support any improvements that could be made to the library to make it run more efficiently for your and other application codes, but again I'm just not sure how much help I could be in figuring out where exactly those issues may be happening.

jbathegit commented 1 year ago

BTW, and just curious - for the codes which are showing the large increase in run times, do those happen to be compiled using 8-byte integers, and where you therefore now need to use the setim8b switch?

AlexanderRichert-NOAA commented 1 year ago

@DavidHuber-NOAA for what it's worth, I think gprof is probably the easiest one to get started with. With gcc or intel, I think you should be able to add the "-pg" to your compile flags, then see steps 2 and 3 here: https://www.thegeekstuff.com/2012/08/gprof-tutorial/ (basically you run the executable then run it with gprof; this might take some tinkering to get it to work with MPI)

DavidHuber-NOAA commented 1 year ago

@AlexanderRichert-NOAA Thanks for the suggestion. Unfortunately, when trying to run after compiling with -pg, the GSI crashes in a netCDF routine. I don't have the log messages anymore, so I can't say exactly what they were ATTM. I am trying out vtune presently, which at least ran to completion. I will let you know when I can get something meaningful out of the results.

@jbathegit No, I don't believe it is using 8-byte integers in any of the BUFR calls. There are only a handful of modules that use that length and they don't deal with reading/processing BUFR files. Thanks for the heads up on the new feature, though.

DavidHuber-NOAA commented 1 year ago

Per a conversation with @aerorahul, the spack-stack upgrade should proceed with BUFR 11.7.0. An upgrade to BUFR 12.0.0 can be handled at a later time. I will open an issue for that shortly.

jack-woollen commented 1 year ago

@jbathegit You might've thought of this already, but just to be sure, it's possible the extra if tests for im8b may be causing the slowdown. One way to check is just remove them all in a test code and see what happens.

edwardhartnett commented 1 year ago

@jbathegit I suggest you write a quick test which captures this performance problem...

jbathegit commented 1 year ago

I'm not sure how to do that if we don't even know what the performance problem is, or what part of the library it may be lurking in. It's not easy to capture something when you don't even know what it is you're trying to catch. I was hoping that the above 4denvar and 3dvar examples would be able to provide some clues to where any performance inefficiencies may be.

Or, if you're referring to Jack's suggestion to just remove all of the im8b checks from a test version of the library, then that wouldn't be a "quick test" either b/c that's over 100 different routines that would need to be changed. And even if that did turn out to be a noticeable time sink, I'm not sure how we could improve on it anyway, since David already noted that he believed there weren't any 8-byte integers in any of their codes which interact with the library, and therefore it just ends up being a one-line "if" test whenever it hits any of those library routines.

Bottom line, I'm all for finding and fixing any such problems, but I really don't know how to narrow it down at this point and feel like I'd be searching for a needle in a haystack.

jack-woollen commented 1 year ago

Right. If the if tests are making the slowdown, then the ones in the lower level routines which get called more often (internally) would take the most time. They would be the ones to start looking at. Not necessarily all 100+ routine containing if tests. I can delve into it when I get a chance. Who knows, it may have nothing to do with if tests. We'll see. It does need to be sorted though.

edwardhartnett commented 1 year ago

Recently I raised the topic of performance tests for BUFR, and this seems like exactly what is needed.

Take a task similar to the one that is now causing the performance problem. Write a test which performs that task, and time it.

Then backport that test to the previous version of budrlib. If it runs slower on the 12.x release, then you have isolated your performance problem.

The test then remains part of the test code, ready to be used again for next release.

jbathegit commented 1 year ago

Take a task similar to the one that is now causing the performance problem. Write a test which performs that task, and time it.

That's the issue and exactly the point I was trying to make - we don't really know what the performance problem is, because it's apparently happening somewhere deep down inside the bowels of the global_4denvar and global_3dvar codes. Those are large and presumably very complex codes which can't really be emulated by writing a separate shorter test, especially since we have no idea which of the undoubtedly numerous "tasks" inside of those codes may be the real culprit that we're trying to isolate here.

In other words, how do we write a "test which performs that task", when we don't even know what that task is? Unless by "task" you mean the entire global data assimilation, but I presume you're not suggesting that we try to add those entire codes to our library test base.

edwardhartnett commented 1 year ago

Instrument the bufr code so you know what is being called, and the run the problem application.

Now write a test that does those same calls, but with phony data.

I do this all the time with the netCDF library. In fact, most of the metadata calls can probably be ignored, making this task simpler.

Alternatively, or additionally, write some performance tests for bufrlib performing commonly-done tasks. Perhaps the answer will pop out; if not you will have some useful tests to monitor performance of future releases.

This discussion should probably be continued in the NCEPLIBS-bufr issue list.

jack-woollen commented 1 year ago

@DavidHuber-NOAA I'm looking at the timing tables in trying to understand what role bufrlib plays in the differences. In the reconfigured tables below the control runs show very little difference and the update runs show quite a bit. Should I assume the bufrlib was not a big part, or any part, of the control runs, and was involved in the update runs? It would help to know the details of what processes make up each of the two types of tests. Thanks.

Test | Threshold | Runtime ctl | Runtime #1 | Runtime #2 | | rt1-ct | rt2-ct | rt1-rt2 -- | -- | -- | -- | -- | -- | -- | -- | -- global_4denvar_loproc_updat | 406.46 | 367.33 | 406.55 | 393.80 | | 39.23 | 26.47 | 12.76 global_4denvar_hiproc_updat | 321.90 | 295.43 | 327.36 | 322.14 | | 31.93 | 26.71 | 5.22 global_3dvar_loproc_updat | 519.66 | 470.50 | 517.56 | 510.61 | | 47.06 | 40.11 | 6.95 global_3dvar_hiproc_updat | 330.29 | 303.49 | 338.32 | 324.83 | | 34.83 | 21.33 | 13.49 total | 1578.31 | 1436.74 | 1589.79 | 1551.37 | | 153.05 | 114.63 | 38.41 | | | | | | | | Test | Threshold | Runtime ctl | Runtime #1 | Runtime #2 | | rt1-ct | rt2-ct | rt1-rt2 global_4denvar_loproc_contrl | 406.46 | 370.38 | 363.71 | 369.51 | | -6.66 | -0.87 | -5.79 global_4denvar_hiproc_contrl | 321.90 | 294.91 | 285.18 | 292.63 | | -9.74 | -2.28 | -7.46 global_3dvar_loproc_contrl | 519.66 | 467.69 | 467.42 | 472.42 | | -0.27 | 4.73 | -5.00 global_3dvar_hiproc_contrl | 330.29 | 301.41 | 300.27 | 300.27 | | -1.14 | -1.14 | 0.00 total | 1578.31 | 1434.39 | 1416.58 | 1434.83 | | -17.81 | 0.44 | -18.25

DavidHuber-NOAA commented 1 year ago

@jack-woollen Yes, the contrl runs were all built with hpc-stack bufr/11.7.0. Only the updat runs were used to test various builds of bufr (including the ctl which was also built with bufr/11.7.0). The best apples-to-apples comparison of bufr/11.7.0 and bufr/12.0.0 runtimes in the table you provided above is probably updat #2 (bufr/12.0.0 with -O3 optimization) and updat ctl (bufr/11.7.0 with -O3 optimization).

jack-woollen commented 1 year ago

@DavidHuber-NOAA I see. Okay, again, which programs are being run for these tests? Is it just GSI, or any others? IE prepobs, etc?

DavidHuber-NOAA commented 1 year ago

@jack-woollen For the global_4denvar and global_3dvar tests, it is just the GSI. Prepobs is not called as the data is already staged on each system and linked in at test runtime.

jack-woollen commented 1 year ago

@jbathegit @edwardhartnett I just ran a simple test on wc2 reading a large satellite file, which is basically all the gsi does with bufrlib, and got much the same result David did in his tests. With bl11 it took 2m5s, with bl12 2m45s. Timing changes a little from run to run but it generally takes something like 40s longer with bl12. My test literally just read the data using ireadmg and ireadsb then stopped. I'll proceed to comment the im8b test blocks for routine involved in the read process and see what happens. The fix may just be something like separate paths through the lower level routines for 4 and 8 byte cases. Depends on how many routines that turns out to be.

jack-woollen commented 1 year ago

Moving the tests to hera, using gfortran, v12 was actually a little bit faster than v11. So how do you compile the bufrlib with ifort on hera?

DavidHuber-NOAA commented 1 year ago

@jack-woollen Hmm, interesting. I compiled version 12 of bufrlib two ways -- CMAKE_BUILD_TYPE="Release" (case 2 here) and CMAKE_BUILD_TYPE="RelWithDebInfo" (case 1). Both are slower than version 11.7.0 with CMAKE_BUILD_TYPE="Release" (case ctl).

edwardhartnett commented 1 year ago

Can this problem be demonstrated in a bufrlib test program?

That is, can we get a simple, one-file program, which runs slower under v12 than v11?

DavidHuber-NOAA commented 1 year ago

@edwardhartnett It sounded like Jack had done just that in this comment.

jack-woollen commented 1 year ago

Timing these codes can be fiddly, depending on what platform it tested on, and when. WCOSS2 has demonstrated widely varying timings, via the unix time command. At first the results looked like v12 was slower than v11. That's when I wrote the comment above. But when I ran it over and over and over the results became more murky. Then on Sunday I tried timing on hera where v12 was a couple seconds faster then v11, but compiling with gfortran. Them I went back to WC2 (and ifort) and got results like hera, v12 faster by a hair.

So far my simple timing test only checks reading a large and compressed (mtiasi) file. I'm thinking running the GSI observer with a full set of data will give a better comparison for that code. Working on it now . The elephant in the room for bufrlib timing may be the prepobs process, since it does reading, writing and arithmetic. I'll look at that also.

My question about about compiling bufrlib on hera with ifort, refers to when I download bufrlib from git, the cmake default is to use gfortran. I'm wondering what to change so it compiles with ifort. Any help with that is appreciated.

DavidHuber-NOAA commented 1 year ago

@jack-woollen Ah, understood. To compile with Intel on Hera, you only need to load the intel/2022.1.2 module then run cmake. Cmake will detect the Intel compiler and create a makefile based on it.

jack-woollen commented 1 year ago

Thanks to @DavidHuber-NOAA help with the gsi cmake setup I have been able to make timing comparisons of running the gsi observer mode with BF-11.7.0 and BF-12.0.0. The observer reads all the BUFR datasets into the program and then will stop if the variable miter=0. So it exercises most if not all the bufr coding within the gsi, and it only takes a few minutes to complete. Trying this and trying that has revealed what looks like roughly half of the timing issue apparent in the gsi runs using bufrlib version 12. Subroutine upb8.f, which is called when unpacking compressed bufr datasets, benefitted from some optimization specifically for the 4byte case. This subroutine is not introduced as part of the mods to the single library build, but rather to accomodate the new WMO approved feature of allowing numeric elements to fill up to 64 bits, instead of just 32 as it has been until now. The fix for upb8.f can be reviewed on dogwood in directory /lfs/h2/emc/global/noscrub/Jack.Woollen/bufrtime/bufr_v12.0.0/NCEPLIBS-bufr/src.

Following is a list of timing results before and after the mods to upb8.

Before upb8 change

The total amount of time in user mode = 268.026405 ~40s The total amount of time in user mode = 308.623302

The total amount of time in user mode = 267.477490 ~42s The total amount of time in user mode = 309.731031

The total amount of time in user mode = 267.551686 ~43s The total amount of time in user mode = 310.306406

The total amount of time in user mode = 266.931843 ~50s The total amount of time in user mode = 316.532808

The total amount of time in user mode = 268.897811 ~47s The total amount of time in user mode = 315.577455

After upb8 change

The total amount of time in user mode = 268.612845 ~22s The total amount of time in user mode = 290.555594

The total amount of time in user mode = 274.917766 ~14s The total amount of time in user mode = 288.483719

The total amount of time in user mode = 265.291353 ~25s The total amount of time in user mode = 290.776556

The total amount of time in user mode = 265.125857 ~25s The total amount of time in user mode = 290.373777

The total amount of time in user mode = 267.488541 ~23s The total amount of time in user mode = 290.313897

DavidHuber-NOAA commented 1 year ago

@jack-woollen Fantastic! Thanks for the update. I will run this optimization through the GSI regression tests.

jbathegit commented 1 year ago

@jack-woollen I also have to add my kudos - well done sir! Your change to add an nbits<=32 block to optimize performance for the 4-byte majority of cases makes perfect sense, and it looks like David's preliminary tests with your update are having the intended effect as well. Thanks again!

I'll go ahead and set up a PR to add this to the library baseline for the next release, and I'll let you know when it's ready.

NOAA-EMC / GSI

Building with the spack-stack unified environment on non-production machines #589