Closed DavidHuber-NOAA closed 11 months ago
I have established a testing environment on Hera referencing a test build of the spack-stack unified environment. So far, this is working well. All regression tests run to completion. I am running the regression tests against the branch to be merged in #571. There are some differences between the two, with the hwrf, global_4denvar, global_3denvar, global_3dvar, and rtma tests all failing cost and/or penalty comparison tests, with values differencing in the 1-5% range on the first iteration.
One potential source for differences is a newer version of the BUFR library. The spack-stack test install is running version 12.0.0 versus the hpc-stack-based install that is compiled against version 11.7.0. Running with the older version with spack-stack would determine if this is a source of difference.
@arunchawla-NOAA @RussTreadon-NOAA @hu5970 @ulmononian @mark-a-potts I gave regression tests on Hera with spack-stack version 1.4.1 + ncdiag/1.1.1 and bufr/11.7.0. The large differences that I was seeing with bufr/12.0.0 are now on the order of 10^4 smaller, so it seems that the new version of bufr is producing quite different results. Is it known if this difference is expected? And if so, would it be desirable to upgrade to bufr/12.0.0 at this time or should I continue incorporating spack-stack with bufr/11.7.0?
Dave ran ctests on Hera using develop
built with bufr/11.7.0
and develop
built with bufr/12.0.0
. He reported large differences in the minimization.
Output from the loproc global_4denvar for updat (bufr/12.0.0
) and contrl (bufr11.7.0
) was examined. Initial differences are limited to prepbufr temperature observations.
The initial temperature (obs-ges) innovation statistics (fort.203) differ for prepbufr report types 120, 132, 180, and 182. These report types are virtual temperatures. Bias and rms are larger for observations closer to the surface. Differences decrease moving up in the atmosphere. Innovation statistics for report types 130, 131, 133, 134, 136, and 136 are identical. These report types are sensible temperatures.
This suggests something odd happens when processing virtual temperature observations with bufr/12.0.0
.
Prints added to src/gsi/read_prepbufr.f90
reveal that when gsi.x
is built with bufr/12.0.0
the following call to ufbqcd
! Obtain program code (VTCD) associated with "VIRTMP" step
call ufbqcd(lunin,'VIRTMP',vtcd)
returns vctd=0.395253-322
for sonde temperature obs. (I used g13.6
formatting and 0.395253-322
is printed to stdout
)
Recompiling the code with bufr/11.7.0
and running the same case returns vtcd=8.00000
for sonde temperature obs.
Do we need to change or modify the call to ufbqcd
or is a different code change needed when moving to bufr/12.0.0
?
Write vtcd
without formatting. bufr/12.0.0
run has 3.952525166729972E-323
.
Replace
! Obtain program code (VTCD) associated with "VIRTMP" step
call ufbqcd(lunin,'VIRTMP',vtcd)
with
! Obtain program code (VTCD) associated with "VIRTMP" step
call ufbqcd(lunin,'VIRTMP',ivtcd)
vtcd = ivtcd
in src/gsi/read_prepbufr.f90
. Recompile gsi.x
with bufr/12.0.0
and run global_3dvar
ctest. This time get expected value of 8.00000000000000
for vtcd
.
The global_3dvar
ctest failed
Hera(hfe07):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build$ ctest -R global_3dvar
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build
Start 1: global_3dvar
1/1 Test #1: global_3dvar .....................***Failed 2039.84 sec
0% tests passed, 1 tests failed out of 1
Total Test time (real) = 2039.92 sec
The following tests FAILED:
1 - global_3dvar (Failed)
This failure, however, was due to the maximum allowable threshold time check. The updat bufr/12.0.0
gsi.x
ran slower than the contrl bufr/11.7.0
gsi.x
.
global_3dvar_hiproc_contrl/stdout:The total amount of wall time = 321.045029
global_3dvar_hiproc_updat/stdout:The total amount of wall time = 358.756143
global_3dvar_loproc_contrl/stdout:The total amount of wall time = 474.135176
global_3dvar_loproc_updat/stdout:The total amount of wall time = 512.837189
This may reflect the additional prints in the bufr/12.0.0
read_prepbufr.f90
.
Comparison of the analysis results shows that the bufr/12.0.0
run reproduces the bufr/11.7.0
run. This result is reasonable. Changing the bufr library version should not change analysis results.
Not sure why replacing a real(8) argument with an integer(4) argument was needed in this call to ufbqcd
. Similar changes were not made elsewhere. Thus, not confident the above change is the best change.
My working directories on Hera are
bufr/12.0.0
in /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12
bufr/11.7.0
in /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/develop
global_3dvar
ctest results are in /scratch1/NCEPDEV/stmp2/Russ.Treadon/bufr12/tmpreg_global_3dvar
Other ctests have not yet been run with the above change.
Hera ctests with bufr/12.0.0
Hera(hfe07):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build$ ctest -j 9
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build
Start 1: global_3dvar
Start 2: global_4dvar
Start 3: global_4denvar
Start 4: hwrf_nmm_d2
Start 5: hwrf_nmm_d3
Start 6: rtma
Start 7: rrfs_3denvar_glbens
Start 8: netcdf_fv3_regional
Start 9: global_enkf
1/9 Test #8: netcdf_fv3_regional ..............***Failed 974.84 sec
2/9 Test #5: hwrf_nmm_d3 ......................***Failed 992.72 sec
3/9 Test #7: rrfs_3denvar_glbens .............. Passed 1037.49 sec
4/9 Test #4: hwrf_nmm_d2 ...................... Passed 1039.35 sec
5/9 Test #9: global_enkf ......................***Failed 1202.61 sec
6/9 Test #6: rtma .............................***Failed 2002.66 sec
7/9 Test #3: global_4denvar ...................***Failed 2224.99 sec
8/9 Test #2: global_4dvar ..................... Passed 2230.50 sec
9/9 Test #1: global_3dvar .....................***Failed 2392.88 sec
33% tests passed, 6 tests failed out of 9
Total Test time (real) = 2393.03 sec
The following tests FAILED:
1 - global_3dvar (Failed)
3 - global_4denvar (Failed)
5 - hwrf_nmm_d3 (Failed)
6 - rtma (Failed)
8 - netcdf_fv3_regional (Failed)
9 - global_enkf (Failed)
All failures except rtma
were due to timing checks. Apart from rtma
, bufr/12.0.0
analyses are identical with their bufr/11.7.0
counterparts. For several failed tests, bufr/12.0.0
executable ran slower than bufr/11.7.0
executable for one or more of the configurations. For example, global_3dvar
tmpreg_global_3dvar/global_3dvar_hiproc_contrl/stdout:The total amount of wall time = 326.543222
tmpreg_global_3dvar/global_3dvar_hiproc_updat/stdout:The total amount of wall time = 359.565145
tmpreg_global_3dvar/global_3dvar_loproc_contrl/stdout:The total amount of wall time = 467.854069
tmpreg_global_3dvar/global_3dvar_loproc_updat/stdout:The total amount of wall time = 509.382080
and global_4denvar
tmpreg_global_4denvar/global_4denvar_hiproc_contrl/stdout:The total amount of wall time = 373.331819
tmpreg_global_4denvar/global_4denvar_hiproc_updat/stdout:The total amount of wall time = 371.220896
tmpreg_global_4denvar/global_4denvar_loproc_contrl/stdout:The total amount of wall time = 385.547573
tmpreg_global_4denvar/global_4denvar_loproc_updat/stdout:The total amount of wall time = 434.971107
The rtma
failure shows different initial innovation statistics for temperature report types 197 and 199. Find another call ufbqcd
in read_prepbufr.f90
!see if file contains GLERL program code (GLCD)
!Obtain code if it exists. Otherwise set to missing (-999)
call status(lunin,lindx,idummy1,idummy2)
call nemtab(lindx,'GLERL',idummy1,cdummy,glret)
if (glret /= 0) then
call ufbqcd(lunin,'GLERL',glcd)
As a test, replace glcd
with iglcd
in calling list and set glcd
to iglcd
after call
!see if file contains GLERL program code (GLCD)
!Obtain code if it exists. Otherwise set to missing (-999)
call status(lunin,lindx,idummy1,idummy2)
call nemtab(lindx,'GLERL',idummy1,cdummy,glret)
if (glret /= 0) then
call ufbqcd(lunin,'GLERL',iglcd)
glcd = iglcd
Recompile and rerun rtma
test. This time the rtma
test passed.
Hera(hfe07):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build$ ctest -R rtma
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build
Start 6: rtma
1/1 Test #6: rtma ............................. Passed 1278.39 sec
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 1278.40 sec
Stop here and let others follow up / investigate more.
Also tagging @jbathegit @edwardhartnett @wx20jjung @dtkleist for awareness.
@DavidHuber-NOAA and @RussTreadon-NOAA thanks for your tests on this! For now I would focus on building and testing with spack-stack that has bufr/11.7.0 I am going to bring this to the attention of @jbathegit @jack-woollen and @edwardhartnett to see if they can throw more light on Russ's results
It looks like you figured the issue out on your own. This was indeed a planned change within v12.0.0, as noted in the release notes...
This was done intentionally as part of the simplification down to one library build. Note that in previous versions of ufbqcd, the 3rd argument returned was just a real, not an explicit real*8. So it was only a real*8 in the _8 and _d builds because of the compile flags for those builds, and not because of some intrinsic need within the library itself. The return value of the 3rd argument is never larger than 255 so it fits easily inside of any real, and in fact it's always a whole number so there's really no reason not to return it as an integer and make everything simpler.
I feel the need to contribute my two cents in supporting backward compatibility with downstream codes as an important principle in software management. Considering reasons not to make such a change as this, seems to me there's not a good enough compensating reason to do it.
I hear what you're saying Jack, and I totally agree that backwards compatibility is an important principle in software management. But in this case, this is no different than where we've already agreed that _8 users will now need to make a change to their application codes to add a call to setim8b, and to make sure that all calls to library functions which return integers are now explicitly declared as integer*4.
In other words, with a major release such as this (where the X of the version X.Y.Z is changing), it's understood that downstream users may need to make some adjustments to their application codes, and this ufbqcd change is just another example of that, and we've already documented this in the accompanying release notes for _8 and _d users.
Note also that, if we hadn't included this ufbqcd change, then as software managers we'd have had to instead develop new internal routines similar to x48 and x84, but for reals instead of integers, since this routine (along with its sister routine ufbqcp) were the only cases of user-callable routines where a real argument was being passed that wasn't already explicitly declared as real*8. And we couldn't have just changed the library to now make that 3rd argument an explicit real*8, because doing so would have similarly impacted any existing _4 users. So bottom line, no matter what we did here, it would have impacted some of our users, and this is just one of the consequences of our earlier decision to eliminate the previous _8 and _d builds and consolidate everything down to a single _4 build.
On a separate but related note, I did note the discussion where some of the spack-stack test cases appear to be running slower with the new 12.0.0 build than with the previous 11.7.0 build. That's definitely not a feature that we intentionally introduced into the new build, nor is it something that we saw in our own regression testing for the library, so if it's still an issue of concern then we'd definitely like to better understand what part(s) of the library may be responsible. Is there any way that some sort of profiling tool might better pinpoint where those differences are (e.g. in which library functions or lines of code) for your test cases?
I ran regression tests with spack-stack/1.4.1 on Hera against the current develop branch with the following results:
1/9 Test #7: rrfs_3denvar_glbens .............. Passed 1293.51 sec
2/9 Test #4: hwrf_nmm_d2 ......................***Failed 1592.51 sec
3/9 Test #5: hwrf_nmm_d3 ...................... Passed 1601.61 sec
4/9 Test #9: global_enkf ...................... Passed 1616.37 sec
5/9 Test #8: netcdf_fv3_regional ..............***Failed 1651.17 sec
6/9 Test #6: rtma ............................. Passed 2195.24 sec
7/9 Test #3: global_4denvar ...................***Failed 2551.00 sec
8/9 Test #2: global_4dvar .....................***Failed 2608.58 sec
9/9 Test #1: global_3dvar .....................***Failed 2730.67 sec
Looking into the netcdf_fv3_regional case, differences first appear in initial radiance penalties. This seems like a CRTM issue, though I am having trouble tracking it down.
The CRTM directories are
hpc-stack: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/crtm/2.4.0
spack-stack: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/crtm-2.4.0-wpiygpl
The CRTM fix directories are
hpc-stack: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/crtm/2.4.0/fix
spack-stack: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/crtm-fix-2.4.0_emc-zvsufvh
The contents of the fix directories are identical. The hpc-stack source code for CRTM is located here: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/crtm-v2.4.0
, which is identical to the version cloned by spack-stack (github.com/JCSDA/crtm version 2.4.0; hash 5ddd0d6).
So there doesn't seem to be an obvious difference in CRTM. The library versions do differ between the two stacks. In particular, hdf5/netcdf 1.14.0/4.9.2 and nemsio in spack-stack is still at 2.5.2 and includes w3nco, but I don't think these should have an impact.
hpc-stack
1) miniconda3/4.12.0 9) netcdf/4.7.4 17) nemsio/2.5.4
2) hpc/1.2.0 10) bufr/11.7.0 18) wrf_io/1.2.0
3) intel/2022.1.2 11) bacio/2.4.1 19) ncio/1.1.2
4) hpc-intel/2022.1.2 12) w3emc/2.9.2 20) crtm/2.4.0
5) impi/2022.1.2 13) sp/2.3.3 21) ncdiag/1.1.1
6) hpc-impi/2022.1.2 14) ip/3.3.3 22) gsi_common
7) cmake/3.20.1 15) sigio/2.3.2 23) prod_util/1.2.2
8) hdf5/1.10.6 16) sfcio/1.4.1 24) gsi_hera.intel
spack-stack
1) intel/2022.1.2 12) zstd/1.5.2 23) nemsio/2.5.2
2) stack-intel/2021.5.0 13) netcdf-c/4.9.2 24) wrf-io/1.2.0
3) impi/2022.1.2 14) netcdf-fortran/4.6.0 25) ncio/1.1.2
4) stack-intel-oneapi-mpi/2021.5.1 15) bufr/11.7.0 26) crtm-fix/2.4.0_emc
5) miniconda/3.9.12 16) bacio/2.4.1 27) git-lfs/2.10.0
6) stack-python/3.9.12 17) w3emc/2.9.2 28) crtm/2.4.0
7) cmake/3.23.1 18) sp/2.3.3 29) gsi-ncdiag/1.1.1
8) curl/7.29.0 19) ip/3.3.3 30) gsi_common
9) pkg-config/0.27.1 20) sigio/2.3.2 31) prod-util/1.2.2
10) zlib/1.2.13 21) sfcio/1.4.1 32) gsi_hera.intel
11) hdf5/1.14.0 22) w3nco/2.4.1
@RussTreadon-NOAA Would you mind taking a look at the RTs on Hera? My regression test directory is in /scratch1/NCEPDEV/nems/David.Huber/GSI/gsi_spackstack/build/regression
and the working directories are in /scratch1/NCEPDEV/stmp/David.Huber/ptmp_spack
.
A check of the run directories shows the crtm coefficients to be identical between the updat and contrl runs.
A check of fort.207
shows initial radiance penalty differences between the updat and contrl for the following satellite_sensors: amsua_n15, amsua_n18, amsua_n19, atms_npp, atms_n20. These are all microwave sensors. However, other microwave satellite_sensors have identical initial penalties between the updat and contrl (e.g., mhs_n19, amsua_metop-b, amsua_metop-c, mhs_metop-c).
Are the same compiler flags used to build the hpc-stack and spack-stack crtm libraries?
@AlexanderRichert-NOAA Could you point me to the build log for spack-stack/1.4.1 CRTM/2.4.0 on Hera? I'm seeing some differences between it and the hpc-stack version.
Sure-- /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/crtm-2.4.0-wpiygpl/.spack/spack-build-out.txt (and there's other build-related info in that directory)
I won't look at it any further unless you'd like me to, but for what it's worth the difference that stands out to me is the CMAKE_BUILD_TYPE (hpc-stack=Release, spack-stack=RelWithDebInfo), and we've seen some differences in other codes based on that setting.
Thanks @AlexanderRichert-NOAA, and thanks @RussTreadon-NOAA for the question. A few obvious differences I see are that
-O3
while spack-stack builds with -O2
-unroll
flag)-inline
flag)So optimizing with -O2
instead of -O3
is a probable culprit. I will try building my own CRTM with spack-stack libraries and CMAKE_BUILD_TYPE=Release
to see if that resolves the differences.
After building the CRTM with spack-stack libraries and CMAKE_BUILD_TYPE=Release
on Hera, I rebuilt the GSI with spack-stack libraries and this custom CRTM then ran regression tests against develop with the following results
1/9 Test 9: global_enkf ...................... Passed 881.88 sec
2/9 Test 8: netcdf_fv3_regional .............. Passed 909.52 sec
3/9 Test 7: rrfs_3denvar_glbens ..............***Failed 913.31 sec
4/9 Test 5: hwrf_nmm_d3 ...................... Passed 919.42 sec
5/9 Test 4: hwrf_nmm_d2 ...................... Passed 975.22 sec
6/9 Test 2: global_4dvar .....................***Failed 1748.62 sec
7/9 Test 6: rtma .............................***Failed 1938.87 sec
8/9 Test 3: global_4denvar ...................***Failed 1992.34 sec
9/9 Test 1: global_3dvar .....................***Failed 2111.99 sec
The RRFS and RTMA tests failed scalability checks. The global_4denvar and global_3dvar both failed memthresh and maxmem tests, both by a factor of ~2 over the maximum allowable memory:
The memory for global_4denvar_loproc_updat is 5188292 KBs. This has exceeded maximum allowable hardware memory limit of 2516582 KBs, resulting in Failure maxmem of the regression test.
The memory for global_3dvar_loproc_updat is 4241260 KBs. This has exceeded maximum allowable hardware memory limit of 2516582 KBs, resulting in Failure maxmem of the regression test.
I'm not sure if this is fatal or not, but regardless it does seem like a very large increase in memory. I'm also not really sure how to debug this.
Lastly, for the global_4dvar test, the contrl and updat siginc files differed. Running ncdump on both files and comparing values revealed the relative differences are on the order of 10^-6 for multiple variables. I verified that the penalties for both runs were identical, which suggests that these differences are related to the netCDF writes. Knowing that netCDF 4.9.2 allows for lossy compression, I checked the file sizes, but they are very similar (hpc-stack: 39487168 vs spack-stack: 39484152) and I believe this option needs to be explicitly invoked, so that doesn't seem to be the culprit. To verify if this is a netCDF 4.9.2 issue, I will run the regression tests again using hpc-stack with netcdf/4.9.2.
Compiling the develop branch with netcdf/4.9.2 did not have an effect on the global_4dvar siginc files. The differences are still there. I will need to dig in and find which other libraries are involved in the creation of these files -- probably ncio and possibly w3emc.
Rebuilding the the ncio library with spack-stack-built netcdf/4.9.2 and hpc-stack build options (Release, instead of the spack-stack RelWithDebInfo) resulted in identical global_4denvar siginc files between the spack-stack and the develop GSI branches.
I believe the mysteries are solved now. Optimization differences between spack-stack and hpc-stack result is some small numerical differences, particularly for the crtm and ncio libraries. I will go ahead and open a PR to bring these changes in.
For whatever it's worth, that RelWithDebInfo will go away in the next spack-stack release, as spack recently changed the default to Release.
That's good to know. Thanks @AlexanderRichert-NOAA!
I found an issue in gsi-ncdiag where allocating the HDF5 chunk size when opening a netCDF file in append mode to 16GB causes maxmem failures. This is a problem with HDF5 1.14.0, but not 1.10.6. A new version of gsi-ncdiag will need to be installed on all platforms under spack-stack to resolve this issue. https://github.com/NOAA-EMC/GSI-ncdiag/issues/7.
@jbathegit @RussTreadon-NOAA @aerorahul I made Russ's fixes for BUFR v12.0.0 within read_prepbufr.f90 and reran regression tests against develop. Also included in this round of regression tests, CRTM and NCIO were built using Release
rather than the spack-stack 'RelWithDebInfo' to maintain consistent results. The global_3dvar and global_4denvar tests failed for time with the following reported runtimes (apologies for losing the global_3dvar_hiproc_contrl time):
Test | Time Threshold | Runtime |
---|---|---|
global_4denvar_loproc_contrl | 406.458776 | 363.714842 |
global_4denvar_loproc_updat | 406.458776 | 406.554885 |
global_4denvar_hiproc_contrl | 321.898368 | 285.175113 |
global_4denvar_hiproc_updat | 321.898368 | 327.356095 |
global_3dvar_loproc_contrl | 519.659476 | 467.417641 |
global_3dvar_loproc_updat | 519.659476 | 517.557172 |
global_3dvar_hiproc_contrl | 330.294649 | --missing-- |
global_3dvar_hiproc_updat | 330.294649 | 338.319537 |
The global_4dvar test also failed for different results in the siginc files. However, the differences are only in the HDF5 headers, and all data and netCDF4 headers are identical (known difference).
I then compiled BUFR 12.0.0 with Release
(mimicking hpc-stack) and reran the same two tests. Now, the runtimes are much improved and only the global_4denvar_hiproc_updat failed its timethresh
test by 0.2 seconds:
Test | Time Threshold | Runtime |
---|---|---|
global_4denvar_loproc_contrl | 406.458776 | 369.507979 |
global_4denvar_loproc_updat | 406.458776 | 393.796189 |
global_4denvar_hiproc_contrl | 321.898368 | 292.634880 |
global_4denvar_hiproc_updat | 321.898368 | 322.139538 |
global_3dvar_loproc_contrl | 519.659476 | 472.417706 |
global_3dvar_loproc_updat | 519.659476 | 510.610903 |
global_3dvar_hiproc_contrl | 330.294649 | 300.267863 |
global_3dvar_hiproc_updat | 330.294649 | 324.827100 |
Lastly, to establish a baseline, I ran just the global_4denvar and global_3dvar tests with BUFR 11.7.0 (built with Release
for consistency), which resulted in the following times:
Test | Time Threshold | Runtime |
---|---|---|
global_4denvar_loproc_contrl | 406.458776 | 370.379223 |
global_4denvar_loproc_updat | 406.458776 | 367.325278 |
global_4denvar_hiproc_contrl | 321.898368 | 294.912487 |
global_4denvar_hiproc_updat | 321.898368 | 295.425141 |
global_3dvar_loproc_contrl | 519.659476 | 467.687210 |
global_3dvar_loproc_updat | 519.659476 | 470.499224 |
global_3dvar_hiproc_contrl | 330.294649 | 301.406168 |
global_3dvar_hiproc_updat | 330.294649 | 303.492537 |
So it does appear that the upgrade to BUFR 12.0.0 slows processing by 20-40 seconds when fully optimized.
Next, I would like to run a profiling tool on the GSI to identify the hot BUFR spots, but I am trying to find a good profiler to use. Compiling with -p
is causing a crash when trying to read netCDF files in parallel. Do you have a better suggestion @jbathegit?
Thanks @DavidHuber-NOAA for this update!
To answer your question, no unfortunately I don't have any better suggestion on how to narrow this down, because I don't have any personal experience with profiling tools. From a code management standpoint, I would certainly support any improvements that could be made to the library to make it run more efficiently for your and other application codes, but again I'm just not sure how much help I could be in figuring out where exactly those issues may be happening.
BTW, and just curious - for the codes which are showing the large increase in run times, do those happen to be compiled using 8-byte integers, and where you therefore now need to use the setim8b switch?
@DavidHuber-NOAA for what it's worth, I think gprof is probably the easiest one to get started with. With gcc or intel, I think you should be able to add the "-pg" to your compile flags, then see steps 2 and 3 here: https://www.thegeekstuff.com/2012/08/gprof-tutorial/ (basically you run the executable then run it with gprof; this might take some tinkering to get it to work with MPI)
@AlexanderRichert-NOAA Thanks for the suggestion. Unfortunately, when trying to run after compiling with -pg
, the GSI crashes in a netCDF routine. I don't have the log messages anymore, so I can't say exactly what they were ATTM. I am trying out vtune presently, which at least ran to completion. I will let you know when I can get something meaningful out of the results.
@jbathegit No, I don't believe it is using 8-byte integers in any of the BUFR calls. There are only a handful of modules that use that length and they don't deal with reading/processing BUFR files. Thanks for the heads up on the new feature, though.
Per a conversation with @aerorahul, the spack-stack upgrade should proceed with BUFR 11.7.0. An upgrade to BUFR 12.0.0 can be handled at a later time. I will open an issue for that shortly.
@jbathegit You might've thought of this already, but just to be sure, it's possible the extra if tests for im8b may be causing the slowdown. One way to check is just remove them all in a test code and see what happens.
@jbathegit I suggest you write a quick test which captures this performance problem...
I'm not sure how to do that if we don't even know what the performance problem is, or what part of the library it may be lurking in. It's not easy to capture something when you don't even know what it is you're trying to catch. I was hoping that the above 4denvar and 3dvar examples would be able to provide some clues to where any performance inefficiencies may be.
Or, if you're referring to Jack's suggestion to just remove all of the im8b checks from a test version of the library, then that wouldn't be a "quick test" either b/c that's over 100 different routines that would need to be changed. And even if that did turn out to be a noticeable time sink, I'm not sure how we could improve on it anyway, since David already noted that he believed there weren't any 8-byte integers in any of their codes which interact with the library, and therefore it just ends up being a one-line "if" test whenever it hits any of those library routines.
Bottom line, I'm all for finding and fixing any such problems, but I really don't know how to narrow it down at this point and feel like I'd be searching for a needle in a haystack.
Right. If the if tests are making the slowdown, then the ones in the lower level routines which get called more often (internally) would take the most time. They would be the ones to start looking at. Not necessarily all 100+ routine containing if tests. I can delve into it when I get a chance. Who knows, it may have nothing to do with if tests. We'll see. It does need to be sorted though.
Recently I raised the topic of performance tests for BUFR, and this seems like exactly what is needed.
Take a task similar to the one that is now causing the performance problem. Write a test which performs that task, and time it.
Then backport that test to the previous version of budrlib. If it runs slower on the 12.x release, then you have isolated your performance problem.
The test then remains part of the test code, ready to be used again for next release.
Take a task similar to the one that is now causing the performance problem. Write a test which performs that task, and time it.
That's the issue and exactly the point I was trying to make - we don't really know what the performance problem is, because it's apparently happening somewhere deep down inside the bowels of the global_4denvar and global_3dvar codes. Those are large and presumably very complex codes which can't really be emulated by writing a separate shorter test, especially since we have no idea which of the undoubtedly numerous "tasks" inside of those codes may be the real culprit that we're trying to isolate here.
In other words, how do we write a "test which performs that task", when we don't even know what that task is? Unless by "task" you mean the entire global data assimilation, but I presume you're not suggesting that we try to add those entire codes to our library test base.
Instrument the bufr code so you know what is being called, and the run the problem application.
Now write a test that does those same calls, but with phony data.
I do this all the time with the netCDF library. In fact, most of the metadata calls can probably be ignored, making this task simpler.
Alternatively, or additionally, write some performance tests for bufrlib performing commonly-done tasks. Perhaps the answer will pop out; if not you will have some useful tests to monitor performance of future releases.
This discussion should probably be continued in the NCEPLIBS-bufr issue list.
@DavidHuber-NOAA I'm looking at the timing tables in trying to understand what role bufrlib plays in the differences. In the reconfigured tables below the control runs show very little difference and the update runs show quite a bit. Should I assume the bufrlib was not a big part, or any part, of the control runs, and was involved in the update runs? It would help to know the details of what processes make up each of the two types of tests. Thanks.
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
Test | Threshold | Runtime ctl | Runtime #1 | Runtime #2 | | rt1-ct | rt2-ct | rt1-rt2 -- | -- | -- | -- | -- | -- | -- | -- | -- global_4denvar_loproc_updat | 406.46 | 367.33 | 406.55 | 393.80 | | 39.23 | 26.47 | 12.76 global_4denvar_hiproc_updat | 321.90 | 295.43 | 327.36 | 322.14 | | 31.93 | 26.71 | 5.22 global_3dvar_loproc_updat | 519.66 | 470.50 | 517.56 | 510.61 | | 47.06 | 40.11 | 6.95 global_3dvar_hiproc_updat | 330.29 | 303.49 | 338.32 | 324.83 | | 34.83 | 21.33 | 13.49 total | 1578.31 | 1436.74 | 1589.79 | 1551.37 | | 153.05 | 114.63 | 38.41 | | | | | | | | Test | Threshold | Runtime ctl | Runtime #1 | Runtime #2 | | rt1-ct | rt2-ct | rt1-rt2 global_4denvar_loproc_contrl | 406.46 | 370.38 | 363.71 | 369.51 | | -6.66 | -0.87 | -5.79 global_4denvar_hiproc_contrl | 321.90 | 294.91 | 285.18 | 292.63 | | -9.74 | -2.28 | -7.46 global_3dvar_loproc_contrl | 519.66 | 467.69 | 467.42 | 472.42 | | -0.27 | 4.73 | -5.00 global_3dvar_hiproc_contrl | 330.29 | 301.41 | 300.27 | 300.27 | | -1.14 | -1.14 | 0.00 total | 1578.31 | 1434.39 | 1416.58 | 1434.83 | | -17.81 | 0.44 | -18.25
With the upgrade to HDF5 version 1.14.0 ongoing (#563), this would be a good time to move to the spack-stack unified environment, which also incorporates this version of HDF5 and netCDF version 4.9.2. Some additions are needed to the unified environment (in particular, ncio, ncdiag, and bufr), but these can be added manually as they already exist in spack. This is also desired for the upcoming UFS SRW release, which will incorporate the GSI.