NOAA-EMC / hpc-stack

Create a software stack for HPC's
GNU Lesser General Public License v2.1
30 stars 36 forks source link

Inconsistent handling of prerequisite modules in lua modulefiles, some of which result in unintended behavior on some platforms #376

Open mkavulich opened 2 years ago

mkavulich commented 2 years ago

Describe the bug

The short summary: using load(modulename) in Lua modulefiles will not necessarily get you the module you expect. This is currently causing failures of the Short Range Weather Application on Cheyenne, and could theoretically result in unexpected behavior on any platform, up to and including build and runtime failures

Many NCEPLIBS have dependencies on each other, this is obvious. Currently, however, there doesn't seem to be any uniformity in how these dependencies are handled; and some of the current methods are less than ideal. Here are the various methods used:

https://github.com/NOAA-EMC/hpc-stack/blob/44c8e077a21099fcc1883cd297783cdfa6c08b23/modulefiles/mpi/compilerName/compilerVersion/mpiName/mpiVersion/atlas/atlas.lua#L15-L18

https://github.com/NOAA-EMC/hpc-stack/blob/44c8e077a21099fcc1883cd297783cdfa6c08b23/modulefiles/mpi/compilerName/compilerVersion/mpiName/mpiVersion/wgrib2/wgrib2.lua#L15-L18

https://github.com/NOAA-EMC/hpc-stack/blob/44c8e077a21099fcc1883cd297783cdfa6c08b23/modulefiles/mpi/compilerName/compilerVersion/mpiName/mpiVersion/madis/madis.lua#L15-L18

To Reproduce On Cheyenne, run these commands and observe the environment changes:

module purge
module load ncarenv/1.3
module load gnu/9.1.0
module load mpt/2.22
module load ncarcompilers/0.5.0
module use /glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.1.0/modulefiles/stack
module load hpc/1.1.0
module load hpc-gnu/9.1.0
module load hpc-mpt/2.22
module load netcdf/4.7.4
module list

Currently Loaded Modules:
  1) ncarenv/1.3           3) hpc/1.1.0   5) hpc-gnu/9.1.0   7) hpc-mpt/2.22   9) netcdf/4.7.4
  2) ncarcompilers/0.5.0   4) gnu/9.1.0   6) mpt/2.22        8) hdf5/1.10.8

module show netcdf
----------------------------------------------------------------------------------------------------------------------------------------
   /glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.1.0/modulefiles/mpi/gnu/9.1.0/mpt/2.22/netcdf/4.7.4.lua:
----------------------------------------------------------------------------------------------------------------------------------------

module load wgrib2/2.0.8

The following have been reloaded with a version change:
  1) netcdf/4.7.4 => netcdf/4.8.1

module list

Currently Loaded Modules:
  1) ncarenv/1.3           3) hpc/1.1.0   5) hpc-gnu/9.1.0   7) hpc-mpt/2.22   9) wgrib2/2.0.8
  2) ncarcompilers/0.5.0   4) gnu/9.1.0   6) mpt/2.22        8) netcdf/4.8.1

module show netcdf
----------------------------------------------------------------------------------------------------------------------------------------
   /glade/u/apps/ch/modulefiles/default/gnu/9.1.0/netcdf/4.8.1.lua:
----------------------------------------------------------------------------------------------------------------------------------------

What is happening here is that wgrib2's load(netcdf) command has no consideration for if the user has already loaded a version of netcdf. So because a netcdf system default module exists on Cheyenne, the module rules for wgrib2 cause the existing netcdf/4.7.4 from hpc-stack to be unloaded, and replaced by the system default netcdf/4.8.1. This causes mismatches in libraries that ultimately results in runtime failures for the Short Range Weather Application; see that issue for more details: https://github.com/ufs-community/ufs-srweather-app/issues/201

Expected behavior One would expect that if you already have a module loaded, that choice will not be overwritten. If a subsequent module is not compatable with your currently loaded modules, it should fail with a descriptive message, not overwrite the current environment. This is especially problematic for systems that already have a built-in module with the same name of one used in hpc-stack: in the case of Cheyenne, this is netCDF, but the problem could manifest on any platform with different versions of the same libraries used in hpc-stack.

System: Cheyenne, but this issue may occur on any platform that has built-in modules with the same names as those found in hpc-stack.

Additional context As far as I can see there are two potential fixes to this.

  1. Simply remove the load command from each .lua module that contains a prereq command, allowing the module to fail if the prerequisite modules are not loaded. This would give the user a descriptive error message if the prerequisites are not met: I would find this an acceptable solution, but I understand that it may be disruptive to some existing workflows.

  2. If the load is required for some workflows, the "overwrite" behavior can be eliminated with a conditional statement:

    if ( not isloaded("netcdf") ) then load("netcdf") end

    This will only load netcdf if there is not another netcdf module already loaded.

Apologies for the long-winded issue, but I wanted to be sure that the entire problem was covered. I am willing to open a PR to fix this issue, but as I mention above, I'm not sure what the best solution is, so I hope for some wider feedback. Let me know if anything else is needed.

climbfuji commented 2 years ago

Mike,

we do this before we build the stack, you may want to do the same when building the app:

module purge module unuse /glade/u/apps/ch/modulefiles/default/compilers export MODULEPATH_ROOT=/glade/p/ral/jntp/GMTB/tools/compiler_mpi_modules module use /glade/p/ral/jntp/GMTB/tools/compiler_mpi_modules/compilers

Basically what this is doing is replacing CISL’s compiler modules with identical ones, except that the module tree doesn’t have the hierarchical modules that CISL provides (it only has compilers, mpi libs and the ncarcompilers modules). You do NOT want to load the default netCDF module from CISL if you use hpc-stack, which has its own netCDF (and the entire stack being built against it for consistency).

Dom

On Jan 17, 2022, at 5:42 PM, Michael Kavulich @.***> wrote:

Describe the bug

The short summary: using load(modulename) in Lua modulefiles will not necessarily get you the module you expect. This is currently causing failures of the Short Range Weather Application on Cheyenne, and could theoretically result in unexpected behavior on any platform, up to and including build and runtime failures

Many NCEPLIBS have dependencies on each other, this is obvious. Currently, however, there doesn't seem to be any uniformity in how these dependencies are handled; and some of the current methods are less than ideal. Here are the various methods used:

ATLAS simply uses prereq, which causes the module load command to fail if the prerequisites are not already loaded. This seems like a natural way to handle required modules: https://github.com/NOAA-EMC/hpc-stack/blob/44c8e077a21099fcc1883cd297783cdfa6c08b23/modulefiles/mpi/compilerName/compilerVersion/mpiName/mpiVersion/atlas/atlas.lua#L15-L18 https://github.com/NOAA-EMC/hpc-stack/blob/44c8e077a21099fcc1883cd297783cdfa6c08b23/modulefiles/mpi/compilerName/compilerVersion/mpiName/mpiVersion/atlas/atlas.lua#L15-L18 wgrib2 (in addition to others) uses a load command prior to the prereq. In addition to this being redundant (both load and prereq will fail if the module does not exist, and if it does exist, prereq will always succeed after a load of the same module), this has a dangerous side-effect of being able to overwrite the existing environment unexpectedly: if, for example, netcdf/4.7.4 is loaded from earlier in the hpc-stack process, but the system already has a built-in netcdf module, the load(netcdf) line will instead load that system default netcdf module, which may be a different version and incompatible with other hpc-stack modules! https://github.com/NOAA-EMC/hpc-stack/blob/44c8e077a21099fcc1883cd297783cdfa6c08b23/modulefiles/mpi/compilerName/compilerVersion/mpiName/mpiVersion/wgrib2/wgrib2.lua#L15-L18 https://github.com/NOAA-EMC/hpc-stack/blob/44c8e077a21099fcc1883cd297783cdfa6c08b23/modulefiles/mpi/compilerName/compilerVersion/mpiName/mpiVersion/wgrib2/wgrib2.lua#L15-L18 MADIS (in addition to others) uses always_load in addition to prereq. This appears to be an even worse option to use, since in addition to the above-mentioned problems, module unload will attempt to load netCDF rather than unloading it; there doesn't seem to be any reason to do this. https://github.com/NOAA-EMC/hpc-stack/blob/44c8e077a21099fcc1883cd297783cdfa6c08b23/modulefiles/mpi/compilerName/compilerVersion/mpiName/mpiVersion/madis/madis.lua#L15-L18 https://github.com/NOAA-EMC/hpc-stack/blob/44c8e077a21099fcc1883cd297783cdfa6c08b23/modulefiles/mpi/compilerName/compilerVersion/mpiName/mpiVersion/madis/madis.lua#L15-L18 To Reproduce On Cheyenne, run these commands and observe the environment changes:

module purge module load ncarenv/1.3 module load gnu/9.1.0 module load mpt/2.22 module load ncarcompilers/0.5.0 module use /glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.1.0/modulefiles/stack module load hpc/1.1.0 module load hpc-gnu/9.1.0 module load hpc-mpt/2.22 module load netcdf/4.7.4 module list

Currently Loaded Modules: 1) ncarenv/1.3 3) hpc/1.1.0 5) hpc-gnu/9.1.0 7) hpc-mpt/2.22 9) netcdf/4.7.4 2) ncarcompilers/0.5.0 4) gnu/9.1.0 6) mpt/2.22 8) hdf5/1.10.8

module show netcdf

/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.1.0/modulefiles/mpi/gnu/9.1.0/mpt/2.22/netcdf/4.7.4.lua:

module load wgrib2/2.0.8

The following have been reloaded with a version change: 1) netcdf/4.7.4 => netcdf/4.8.1

module list

Currently Loaded Modules: 1) ncarenv/1.3 3) hpc/1.1.0 5) hpc-gnu/9.1.0 7) hpc-mpt/2.22 9) wgrib2/2.0.8 2) ncarcompilers/0.5.0 4) gnu/9.1.0 6) mpt/2.22 8) netcdf/4.8.1

module show netcdf

/glade/u/apps/ch/modulefiles/default/gnu/9.1.0/netcdf/4.8.1.lua:

What is happening here is that wgrib2's load(netcdf) command has no consideration for if the user has already loaded a version of netcdf. So because a netcdf system default module exists on Cheyenne, the module rules for wgrib2 cause the existing netcdf/4.7.4 from hpc-stack to be unloaded, and replaced by the system default netcdf/4.8.1. This causes mismatches in libraries that ultimately results in runtime failures for the Short Range Weather Application; see that issue for more details: ufs-community/ufs-srweather-app#201 https://github.com/ufs-community/ufs-srweather-app/issues/201 Expected behavior One would expect that if you already have a module loaded, that choice will not be overwritten. If a subsequent module is not compatable with your currently loaded modules, it should fail with a descriptive message, not overwrite the current environment. This is especially problematic for systems that already have a built-in module with the same name of one used in hpc-stack: in the case of Cheyenne, this is netCDF, but the problem could manifest on any platform with different versions of the same libraries used in hpc-stack.

System: Cheyenne, but this issue may occur on any platform that has built-in modules with the same names as those found in hpc-stack.

Additional context As far as I can see there are two potential fixes to this.

Simply remove the load command from each .lua module that contains a prereq command, allowing the module to fail if the prerequisite modules are not loaded. This would give the user a descriptive error message if the prerequisites are not met: I would find this an acceptable solution, but I understand that it may be disruptive to some existing workflows.

If the load is required for some workflows, the "overwrite" behavior can be eliminated with a conditional statement:

if ( not isloaded("netcdf") ) then load("netcdf") end This will only load netcdf if there is not another netcdf module already loaded.

Apologies for the long-winded issue, but I wanted to be sure that the entire problem was covered. I am willing to open a PR to fix this issue, but as I mention above, I'm not sure what the best solution is, so I hope for some wider feedback. Let me know if anything else is needed.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/376, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C2RJJNHAIIBYW534WUWDUWSZQPANCNFSM5MF323VA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you are subscribed to this thread.

kgerheiser commented 2 years ago

@mkavulich does @climbfuji's suggestion help?

christopherwharrop-noaa commented 2 years ago

@climbfuji @kgerheiser @mkavulich - I am now running into the same issue trying to build RRFS on Cheyenne. What happens is that the hpc-stack module file for NetCDF on Cheyenne is doing:

harrop/ufs-srweather-app.RRFS_dev1-refactor> module show netcdf/4.7.4
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   /glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/modulefiles/mpi/intel/2021.2/mpt/2.22/netcdf/4.7.4.lua:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
help([[]])
conflict("netcdf")
load("hdf5")
prereq("hdf5")
prepend_path("PATH","/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/intel-2021.2/mpt-2.22/netcdf/4.7.4/bin")
prepend_path("LD_LIBRARY_PATH","/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/intel-2021.2/mpt-2.22/netcdf/4.7.4/lib")
prepend_path("DYLD_LIBRARY_PATH","/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/intel-2021.2/mpt-2.22/netcdf/4.7.4/lib")
prepend_path("CPATH","/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/intel-2021.2/mpt-2.22/netcdf/4.7.4/include")
prepend_path("MANPATH","/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/intel-2021.2/mpt-2.22/netcdf/4.7.4/share/man")
setenv("NETCDF","/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/intel-2021.2/mpt-2.22/netcdf/4.7.4")
setenv("NETCDF_ROOT","/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/intel-2021.2/mpt-2.22/netcdf/4.7.4")
setenv("NETCDF_INCLUDES","/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/intel-2021.2/mpt-2.22/netcdf/4.7.4/include")
setenv("NETCDF_LIBRARIES","/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/intel-2021.2/mpt-2.22/netcdf/4.7.4/lib")
setenv("NETCDF_VERSION","4.7.4")
setenv("NetCDF","/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/intel-2021.2/mpt-2.22/netcdf/4.7.4")
setenv("NetCDF_ROOT","/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/intel-2021.2/mpt-2.22/netcdf/4.7.4")
setenv("NetCDF_INCLUDES","/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/intel-2021.2/mpt-2.22/netcdf/4.7.4/include")
setenv("NetCDF_LIBRARIES","/glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/intel-2021.2/mpt-2.22/netcdf/4.7.4/lib")
setenv("NetCDF_VERSION","4.7.4")
whatis("Name: netcdf")
whatis("Version: 4.7.4")
whatis("Category: library")
whatis("Description: NetCDF4 C, CXX and Fortran library")

The load("hdf5") in that module file causes the already loaded HDF5 from hpc-stack to be replaced by the NCAR system default one, which is wrong. A similar thing happens with NetCDF in a different modulefile. Right now, the env/build_cheyenne_intel.env file for the ufs-srweather-app does this:

module purge

module load cmake/3.18.2
module load ncarenv/1.3
module load intel/2021.2
module load mpt/2.22
module load ncarcompilers/0.5.0
module unload netcdf

module use /glade/p/ral/jntp/GMTB/tools/hpc-stack-v1.2.0/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2021.2
module load hpc-mpt/2.22

module load srw_common

export CMAKE_C_COMPILER=mpicc
export CMAKE_CXX_COMPILER=mpicxx
export CMAKE_Fortran_COMPILER=mpif90
export CMAKE_Platform=cheyenne.intel

I'd like to try out the suggestion @climbfuji suggested, but it's not clear to me which parts of the build env file I need to keep.

climbfuji commented 2 years ago

Basically, what you need to do is to insert this block between module purge and module load cmake, keep everything else as is:

module unuse /glade/u/apps/ch/modulefiles/default/compilers
export MODULEPATH_ROOT=/glade/p/ral/jntp/GMTB/tools/compiler_mpi_modules
module use /glade/p/ral/jntp/GMTB/tools/compiler_mpi_modules/compilers

This should work ...

climbfuji commented 2 years ago

If it doesn't, then we need to mirror additional modulefiles from the CISL module tree to the one under GMTB/tools.

christopherwharrop-noaa commented 2 years ago

Thanks @climbfuji. Only problem for me now is that I need the MKL module (for GSI) and those lines prevent it from finding it.

climbfuji commented 2 years ago

Thanks @climbfuji. Only problem for me now is that I need the MKL module (for GSI) and those lines prevent it from finding it.

Try again please. If there is anything else missing, please let me know. You should also check whether the module unload netcdf is still needed (it shouldn't be loaded at this point).

christopherwharrop-noaa commented 2 years ago

@climbfuji - Initial test works! And, no, I do not think the module unload netcdf is needed, but I'm going to double check it. Because I've done a LOT of fooling around trying to debug, and may have corrupted my shell environment, I'm going to redo my test from a fresh login shell and double check it to make sure it all still works. Stand by...

christopherwharrop-noaa commented 2 years ago

Ok. I think I'm good for what I was doing. However, I may have hijacked @mkavulich 's original question a little. I think his point still stands as a general issue and probably still needs to be addressed. I sure do appreciate the help with the Cheyenne workaround, though. Many thanks.

kgerheiser commented 2 years ago

418 should address @mkavulich's original issue. This will let users choose their own version without LMod overriding it.

christopherwharrop-noaa commented 2 years ago

One potential downside to this solution is that those three lines:

module unuse /glade/u/apps/ch/modulefiles/default/compilers
export MODULEPATH_ROOT=/glade/p/ral/jntp/GMTB/tools/compiler_mpi_modules
module use /glade/p/ral/jntp/GMTB/tools/compiler_mpi_modules/compilers

modify the user's environment in a semi-irreversible way. What I mean is, they change the user's environment such that they can't easily go back to the standard NCAR environment. I think we will probably soon see all of the App env files become module files soon, which, if done correctly, should help with that, but we don't have that yet.

mkavulich commented 2 years ago

@kgerheiser Thank you, #418 looks like a good solution to this problem. "depends_on" looks like a better flag than "prereq", although the latter would also work. I do not like the idea of "module unuse" for the reasons @christopherwharrop-noaa mentioned, it is a destructive operation to the user's environment. Regardless, this issue should be handled at its source here in the modulefiles, not by requiring users to fiddle with their environment.

mkavulich commented 2 years ago

Now that #418 has been merged, is there any chance we can get a bug-fix release of hpc-stack to build in place of the current versions? The current hpc-stack on disk for Cheyenne, Hera, etc. still has this old bug so that modules can "hijack" the loaded versions other modules, and it currently makes it so UFS_UTILS can not build by default on Cheyenne (because the default g2 library is too old, and the UPP module over-writes the correctly loaded version with the default version). On Hera, the correct version of g2 just happens to be default, so builds are succeeding but only by chance!

climbfuji commented 2 years ago

Is the solution to simply remove the default symlinks and require the application to explicitly specify the version?