NOAA-EMC / UPP

Other
33 stars 95 forks source link

Build failing on some front nodes of Jet #919

Closed InnocentSouopgui-NOAA closed 3 months ago

InnocentSouopgui-NOAA commented 3 months ago

Upp build is failing on some front nodes of Jet because of failure to locate some modules.

InnocentSouopgui-NOAA commented 3 months ago

After the full migration to Rocky8, trying to build using spack-stack environment for Centos7 is not available. Everything has to build using Rocky8 modules.

ulmononian commented 3 months ago

would you be able to post your steps & the error message you are receiving?

InnocentSouopgui-NOAA commented 3 months ago

would you be able to post your steps & the error message you are receiving? The error can be reproduced by connecting to fe3, clone the upp repos and build.

For instance when I do it, I get the error message bellow.

[USER@fe3 tests]$ ./compile_upp.sh 
Building for machine jet_c, compiler intel
Lmod has detected the following error:  These module(s) or extension(s) exist but cannot be
loaded as requested: "cmake/3.23.1", "jasper/2.0.32"
   Try: "module spider cmake/3.23.1 jasper/2.0.32" to see how to load the module(s).

Executing this command requires loading "cmake/3.23.1" which failed while processing the following
module(s):

    Module fullname  Module Filename
    ---------------  ---------------
    jet_c            /mnt/lfs1/NESDIS/nesdis-rdo2/Innocent.Souopgui/devel/upp-rocky8/modulefiles/jet_c.luaExecuting this command requires loading "jasper/2.0.32" which failed while processing the following
module(s):

    Module fullname  Module Filename
    ---------------  ---------------
    upp_common       /mnt/lfs1/NESDIS/nesdis-rdo2/Innocent.Souopgui/devel/upp-rocky8/modulefiles/upp_common.lua
    jet_c            /mnt/lfs1/NESDIS/nesdis-rdo2/Innocent.Souopgui/devel/upp-rocky8/modulefiles/jet_c.lua
ulmononian commented 3 months ago

@InnocentSouopgui-NOAA i just tried the following on jet fe3:

module use /mnt/lfs1/NESDIS/nesdis-rdo2/Innocent.Souopgui/devel/upp-rocky8/modulefiles
ml jet

and everything loaded properly:

$ ml

Currently Loaded Modules:
  1) intel/2022.1.2                   10) zlib/1.2.13           19) parallel-netcdf/1.12.2  28) sp/2.5.0
  2) stack-intel/2021.5.0             11) libpng/1.6.37         20) parallelio/2.5.10       29) w3emc/2.10.0
  3) impi/2022.1.2                    12) pkg-config/0.27.1     21) bacio/2.4.1             30) nemsio/2.5.4
  4) stack-intel-oneapi-mpi/2021.5.1  13) hdf5/1.14.0           22) crtm-fix/2.4.0.1_emc    31) sigio/2.3.2
  5) nghttp2/1.57.0                   14) snappy/1.1.10         23) git-lfs/2.10.0          32) sfcio/1.4.1
  6) curl/8.4.0                       15) zstd/1.5.2            24) crtm/2.4.0.1            33) wrf-io/1.2.0
  7) cmake/3.23.1                     16) c-blosc/1.21.5        25) g2/3.4.5                34) upp_common
  8) libjpeg/2.1.0                    17) netcdf-c/4.9.2        26) g2tmpl/1.10.2           35) jet
  9) jasper/2.0.32                    18) netcdf-fortran/4.6.1  27) ip/4.3.0

can you share the clone & build steps you did?

InnocentSouopgui-NOAA commented 3 months ago

@InnocentSouopgui-NOAA i just tried the following on jet fe3:

module use /mnt/lfs1/NESDIS/nesdis-rdo2/Innocent.Souopgui/devel/upp-rocky8/modulefiles
ml jet

and everything loaded properly:

$ ml

Currently Loaded Modules:
  1) intel/2022.1.2                   10) zlib/1.2.13           19) parallel-netcdf/1.12.2  28) sp/2.5.0
  2) stack-intel/2021.5.0             11) libpng/1.6.37         20) parallelio/2.5.10       29) w3emc/2.10.0
  3) impi/2022.1.2                    12) pkg-config/0.27.1     21) bacio/2.4.1             30) nemsio/2.5.4
  4) stack-intel-oneapi-mpi/2021.5.1  13) hdf5/1.14.0           22) crtm-fix/2.4.0.1_emc    31) sigio/2.3.2
  5) nghttp2/1.57.0                   14) snappy/1.1.10         23) git-lfs/2.10.0          32) sfcio/1.4.1
  6) curl/8.4.0                       15) zstd/1.5.2            24) crtm/2.4.0.1            33) wrf-io/1.2.0
  7) cmake/3.23.1                     16) c-blosc/1.21.5        25) g2/3.4.5                34) upp_common
  8) libjpeg/2.1.0                    17) netcdf-c/4.9.2        26) g2tmpl/1.10.2           35) jet
  9) jasper/2.0.32                    18) netcdf-fortran/4.6.1  27) ip/4.3.0

can you share the clone & build steps you did?

@ulmononian Two notes:

  1. /mnt/lfs1/NESDIS/nesdis-rdo2/Innocent.Souopgui/devel/upp-rocky8 already has the fix.
  2. The problem is in the build scripts;

When the the compile script tests/compile_upp.sh is called, it uses the script tests/detect_machine.sh; The last script loads jet_c for frontend fe[1-4] and jet for frontends fe[5-8]. jet_c is the module file that has a problem on Rocky8.

To reproduce the problem, you will need to clone the UPP repository and call the tests/compile_upp.sh, or load the module jet_c. I now have the freshly clone repository at /mnt/lfs1/NESDIS/nesdis-rdo2/Innocent.Souopgui/devel/upp-develop`

$ module purge
$ module use /mnt/lfs1/NESDIS/nesdis-rdo2/Innocent.Souopgui/devel/upp-develop/modulefiles
$ module load jet_c

and get the error

Lmod has detected the following error:  These module(s) or extension(s) exist but
cannot be loaded as requested: "cmake/3.23.1", "jasper/2.0.32"
   Try: "module spider cmake/3.23.1 jasper/2.0.32" to see how to load the module(s).

Executing this command requires loading "cmake/3.23.1" which failed while processing the
following module(s):

    Module fullname  Module Filename
    ---------------  ---------------
    jet_c            /mnt/lfs1/NESDIS/nesdis-rdo2/Innocent.Souopgui/devel/upp-develop/modulefiles/jet_c.luaExecuting this command requires loading "jasper/2.0.32" which failed while processing the
following module(s):

    Module fullname  Module Filename
    ---------------  ---------------
    upp_common       /mnt/lfs1/NESDIS/nesdis-rdo2/Innocent.Souopgui/devel/upp-develop/modulefiles/upp_common.lua
    jet_c            /mnt/lfs1/NESDIS/nesdis-rdo2/Innocent.Souopgui/devel/upp-develop/modulefiles/jet_c.lua

while

$ module load jet

produces the expected result

ulmononian commented 3 months ago

@InnocentSouopgui-NOAA ok thank you. so it sounds like the detect_machine.sh script needs to be updated to use the rocky8 stack on fe1,3,4 (it should exclude fe2 since this remains a centos node for now). @FernandoAndrade-NOAA what is your take?

InnocentSouopgui-NOAA commented 3 months ago

@InnocentSouopgui-NOAA ok thank you. so it sounds like the detect_machine.sh script needs to be updated to use the rocky8 stack on fe1,3,4 (it should exclude fe2 since this remains a centos node for now). @FernandoAndrade-NOAA what is your take?

That is right. I already updated detect_machine.sh script in the PR #920 which is part of a bigger effort to migrate Global Workflow to Rocky8 on Jet NOAA-EMC/global-workflow#2377 As for excluding fe2, Someone commented that next Tuesday, there will be no Centos node left on Jet. I can't find that comment anymore. So the question is still there should fe2 be excluded or not?

WenMeng-NOAA commented 3 months ago

@InnocentSouopgui-NOAA ok thank you. so it sounds like the detect_machine.sh script needs to be updated to use the rocky8 stack on fe1,3,4 (it should exclude fe2 since this remains a centos node for now). @FernandoAndrade-NOAA what is your take?

That is right. I already updated detect_machine.sh script in the PR #920 which is part of a bigger effort to migrate Global Workflow to Rocky8 on Jet NOAA-EMC/global-workflow#2377 As for excluding fe2, Someone commented that next Tuesday, there will be no Centos node left on Jet. I can't find that comment anymore. So the question is still there should fe2 be excluded or not?

@InnocentSouopgui-NOAA Given the final jet rocky8 transition next week, We don't need to exclude fe2. Your Upp PR #920 look good for me.

ulmononian commented 3 months ago

@InnocentSouopgui-NOAA @WenMeng-NOAA given this, it sounds like this issue is taken care of from the spack-stack perspective?

InnocentSouopgui-NOAA commented 3 months ago

@InnocentSouopgui-NOAA @WenMeng-NOAA given this, it sounds like this issue is taken care of from the spack-stack perspective?

Yes, there is already a new installation of spack-stack for Rocky. I believe that during the transition when we had some partitions and frontend with CentOS7 and other with Rocky8, what is now a bug was implemented to automatically build on CentOS using centOS spack-stack, and using rocky8 spack-stack on rocky8 front-end. Now all front-end nodes except fe2 are running rocky8, and fe2 is scheduled to move to rocky8 soon.

you can have a look at the PR #920 solving this issue. It's part of a set of PR to migrate global workflow to rocky8 on Jet.