NOAA-EMC / GSI

Gridpoint Statistical Interpolation
GNU Lesser General Public License v3.0
66 stars 150 forks source link

GSI building issues with HDF5 1.14.0 #563

Closed junwang-noaa closed 11 months ago

junwang-noaa commented 1 year ago

The library team is trying to update HDF5 from the current 1.10.6 to new version 1.14.0 which contains the parallel netcdf bug fixes. However the initial test GSI built with HDF5 1.14.0 failed (please see comments from George V. in https://github.com/ufs-community/ufs-weather-model/issues/1621). Could someone from GSI group to take a look at this?

junwang-noaa commented 1 year ago

@Hang-Lei-NOAA @AlexanderRichert-NOAA would you please provide the module files for HDF5 1.14.0 related libraries? Thanks

AlexanderRichert-NOAA commented 1 year ago

On Acorn, to use HDF5 1.12.2: /lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.3.0/envs/unified-env-compute-hdf5-1.12.2/install/modulefiles/Core and to use HDF5 1.14.0: /lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.3.0/envs/unified-env-compute/install/modulefiles/Core

Add those to $MODULEPATH (module use ...) and load the stack-intel and stack-cray-mpich modules as needed.

Hang-Lei-NOAA commented 1 year ago

On Acorn, we do not have fix files for GSI, someone need to update the link to fix files: Please use the following modulefiles to build GSI: /lfs/h1/emc/eib/noscrub/Hang.Lei/GSI/modulefiles/gsi_wcoss2.lua /lfs/h1/emc/eib/noscrub/Hang.Lei/GSI/modulefiles/gsi_common.lua

On Tue, Apr 25, 2023 at 11:19 AM Alex Richert @.***> wrote:

On Acorn, to use HDF5 1.12.2:

/lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.3.0/envs/unified-env-compute-hdf5-1.12.2/install/modulefiles/Core and to use HDF5 1.14.0

/lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.3.0/envs/unified-env-compute/install/modulefiles/Core

and load the stack-intel and stack-cray-mpich modules as needed.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/GSI/issues/563#issuecomment-1521981341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFGQ65KTEAH7D4REUCLXC7TQHANCNFSM6AAAAAAXLBFWTE . You are receiving this because you were mentioned.Message ID: @.***>

dtkleist commented 1 year ago

@arunchawla-NOAA -- I believe you have someone to assign to this issue, correct?

arunchawla-NOAA commented 1 year ago

Yes. Let me get back on this

DavidHuber-NOAA commented 1 year ago

@natalie-perlin and I have made some progress on this. Starting with the branch RussTreadon-NOAA:intel2022, I updated the hpc-stack location and hdf5/netcdf versions then ran regression tests, comparing against @RussTreadon-NOAA's branch as a baseline. All hdf5/1.14.0 tests completed, but some of the hdf5/1.10.6 tests stalled and/or ran into time limits (global_3dvar, global_4dvar, and global_4denvar). Also, multiple tests produced different analysis results, which I have not analyzed in detail, but are concerning as they differ with the same hdf5/1.14.0 executable between loproc and hiproc tests (hwrf_nmm_d2 and d3, netcdf_fv3_regional, rrfs_3denvar_glbens, and rtma).

I ran similar tests on Hera and @natalie-perlin ran them on Gaea. Hera ran to completion (though I do not have the test results anymore, but will rerun them now that Hera is back up from maintenance), while Gaea crashed with hdf5/1.14.0 for the global_3dvar and global_4denvar tests.

Not that to run the tests with different modulefiles, I used a method described by @RussTreadon-NOAA to load the appropriate modulefiles at run time by modifying sub_jet as follows:

 myuser=$LOGNAME
 myhost=$(hostname)

+exp=${jobname}
+if [[ ${exp} == *"updat"* ]]; then
+   modulefiles=/mnt/lfs1/NAGAPE/epic/David.Huber/GSI/gsi_hdf5.14/modulefiles
+elif [[ ${exp} == *"contrl"* ]]; then
+   modulefiles=/mnt/lfs1/NAGAPE/epic/David.Huber/GSI/gsi_22/modulefiles
+fi
+
+
 DATA=${DATA:-$ptmp/tmp}

 mkdir -p $DATA
@@ -126,7 +135,7 @@ echo "" >>$cfile
 echo ". /apps/lmod/lmod/init/sh"                           >> $cfile
 echo "module purge"                                        >> $cfile
-echo "module use $gsisrc/modulefiles"                      >> $cfile
+echo "module use $modulefiles"                             >> $cfile
 echo "module load gsi_jet" >> $cfile
 echo "module list"                                         >> $cfile
DavidHuber-NOAA commented 1 year ago

On Hera, all tests pass except global_4dvar, global_4denvar, and global_3dvar:

global_4dvar fails due to different siginc files between the loproc_updat and loproc_contrl, which I will investigate further global_4denvar and global_3dvar fail due to maximum memory threshold exceedance, which are non-critical.

DavidHuber-NOAA commented 1 year ago

On further investigation, the loproc_contrl and loproc_updat siginc files generated in the global_4dvar step are slightly different sizes (39487168 vs 39483763 bytes) and appear to contain different header information, but when compared with nccmp, the data, metadata, and encoding are identical, thus I believe this is a false positive.

DavidHuber-NOAA commented 1 year ago

I found an issue in gsi-ncdiag where allocating the HDF5 chunk size when opening a netCDF file in append mode to 16GB causes maxmem failures. This is a problem with HDF5 1.14.0, but not 1.10.6. A new version of gsi-ncdiag will need to be installed on all platforms under spack-stack to resolve this issue. NOAA-EMC/gsi-ncdiag#7.