NOAA-EMC / hpc-stack

Create a software stack for HPC's
GNU Lesser General Public License v2.1
30 stars 36 forks source link

Compiling with hpc-stack a WW3 wave model program produces empty grib2 files #137

Closed RobertoPadilla-NOAA closed 3 years ago

RobertoPadilla-NOAA commented 3 years ago

Describe the bug The use of the hpc-stack to compile a program to produce grib2 files from wave-component of the coupled system allows to build the executable but this executable produces empty grib2 files. If the hpc-stack is not used and modules are loaded separately then the executable produces valid grib2 files.

To Reproduce ==Install and run the test as described at https://github.com/NOAA-EMC/global-workflow/blob/feature/coupled-crow/README.md == Except for the first steps "Checkout the source code and scripts" == Use the following instructions git clone https://github.com/Jessica-Meixner-NOAA/global-workflow coupled-workflow cd coupled-workflow git checkout feature/p5ww3post git submodule update --init --recursive #Update submodules

==Follow the instructions . cd sorc sh checkout.sh coupled # Check out the coupled code, EMC_post, gsi, ... sh build_ncep_post.sh #This command will build ncep_post sh build_ww3prepost.sh #This command will build ww3 prep and post exes sh build_fv3_coupled.sh #This command will build ufs-s2s-model sh build_reg2grb2.sh #This command will build exes for ocean-ice post =To link fixed files and executable programs for the coupled application: =On Hera: sh link_fv3gfs.sh emc hera coupled =On Orion: sh link_fv3gfs.sh emc orion coupled d ../workflow cp user.yaml.default user.yaml =Then, open and edit user.yaml: =EXPROOT: Place for experiment directory, make sure you have write access. =FIX_SCRUB: True if you would like to fix the path to ROTDIR(under COMROOT) and RUNDIR(under DATAROOT) False if you would like CROW to detect available disk space automatically. *** Please use FIX_SCRUB: True on Hera/Orion until further notice (2020/03) =COMROOT: Place to generate ROTDIR for this experiment. =DATAROOT: Place for temporary storage for each job of this experiment. =cpu_project: cpu project that you are working with. =hpss_project: hpss project that you are working with.

==IMPORTANT, next step is different from the git page =In HERA ./setup_case.sh -p HERA ../cases/coupled_free_forecast_wave.yaml test2d =In ORION ./setup_case.sh -p ORION ../cases/coupled_free_forecast_wave.yaml test2d

=This will create a experiment directory ($EXPERIMENT_DIRECTORY). In the current example, $EXPERIMENT_DIRECTORY=$EXPROOT/test2d.

=For ORION: First make sure you have python loaded: module load contrib module load rocoto #Make sure to use 1.3.2 module load intelpython3

./make_rocoto_xml_for.sh $EXPERIMENT_DIRECTORY

=Run the model using the workflow cd $EXPERIMENT_DIRECTORY module load rocoto =Run several time rocotorun until all process are done rocotorun -w workflow.xml -d workflow.db =Check the status of your test rocotostat -w workflow.xml -d workflow.db

=You'll find the grib2 files in the directory you created: cd $COMROOT/test2d/gfs.20130401/00/wave/gridded wgrib2 -V gfswave.t00z.global.0p50.f000.grib2

=You'll see that all wave variables (min, averge, max) have the same value.

Expected behavior Produce valid grib2 files from the wave model using the modules from the phc-stack.

System: Hera and Orion

Additional context Add any other context about the problem here.

JessicaMeixner-NOAA commented 3 years ago

@RobertoPadilla-NOAA can we give them a smaller test case where they don't have to run the whole workflow?

Also updates from my fork that you have pointed them to have long since gone back into feature/coupled-crow. I'd prefer that people are not using that anymore.

aerorahul commented 3 years ago

Please provide a single standalone script to reproduce the behavior.

kgerheiser commented 3 years ago

Which hpc-stack are you using? I just ran wgrib2 -V on a random grib2 file I have and it returned what I think are correct answers.

There was a bug in our wgrib2 build, but it has since been fixed in a newer version of hpc-stack.

Here is my output:

53:103480536:vt=2021011206:surface:anl:HPBL Planetary Boundary Layer Height [m]:
    ndata=4718592:undef=0:mean=564.815:min=17.8054:max=4805.09
    grid_template=40:winds(N/S):
    Gaussian grid: (3072 x 1536) units 1e-06 input WE:NS output WE:SN
    number of latitudes between pole-equator=768 #points=4718592
    lat 89.910324 to -89.910324
    lon 0.000000 to 359.882813 by 0.117188

54:109292212:vt=2021011206:surface:anl:LAND Land Cover (0=sea, 1=land) [Proportion]:
    ndata=4718592:undef=0:mean=0.337744:min=0:max=1
    grid_template=40:winds(N/S):
    Gaussian grid: (3072 x 1536) units 1e-06 input WE:NS output WE:SN
    number of latitudes between pole-equator=768 #points=4718592
    lat 89.910324 to -89.910324
    lon 0.000000 to 359.882813 by 0.117188

55:109385913:vt=2021011206:surface:anl:ICEC Ice Cover [Proportion]:
    ndata=4718592:undef=0:mean=0.108133:min=0:max=1
    grid_template=40:winds(N/S):
    Gaussian grid: (3072 x 1536) units 1e-06 input WE:NS output WE:SN
    number of latitudes between pole-equator=768 #points=4718592
    lat 89.910324 to -89.910324
    lon 0.000000 to 359.882813 by 0.117188
JessicaMeixner-NOAA commented 3 years ago

@kgerheiser those are atm grib files, not wave grib files. I've tried with both 1.0.0 and 1.1.0, without success.

kgerheiser commented 3 years ago

I thought it might be an issue with the -V option of the executable, but if it's still broken with v1.1.0 then that's a problem. I thought that might fix it.

RobertoPadilla-NOAA commented 3 years ago

Jessica, I don't have a smaller test case, Do you? If not, I have to work on building one.

Roberto

On Tue, Jan 12, 2021 at 12:50 PM Jessica Meixner notifications@github.com wrote:

@RobertoPadilla-NOAA https://github.com/RobertoPadilla-NOAA can we give them a smaller test case where they don't have to run the whole workflow?

Also updates from my fork that you have pointed them to have long since gone back into feature/coupled-crow. I'd prefer that people are not using that anymore.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/137#issuecomment-758829418, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALZGY3CNC7BVTLD4CRB73SDSZSDWXANCNFSM4V7SAIXA .

kgerheiser commented 3 years ago

The command used to create the grib file would be a good start

JessicaMeixner-NOAA commented 3 years ago

No @RobertoPadilla-NOAA but you made multiple tests without the workflow so I assumed you have something. I think we want the test to be super small and simple, so you just have the binary output from a model run, the ww3_grib.inp file and then a simple script for building the ww3_grib exe they need to run (1 way with the modules that work from the non-hpc-stack modules) and one w/the hpc-stack modules.

RobertoPadilla-NOAA commented 3 years ago

Ok @kgerheiser @aerorahul , I'll be back to you once I have the small test ready.

RobertoPadilla-NOAA commented 3 years ago

@kgerheiser @aerorahul I was working with the canned test for you, on Hera, but now hpc-stack modules can not be found. This is probably related to the problem of data loss this morning (Do you know if this is true?) On Orion, looking into detail, the hpc-stack was not the issue, it was a version of the jasper module. I changed jasper/2.0.15 by jasper/1.900.1, ww3_grib works properly using the hpc-stack.

kgerheiser commented 3 years ago

If they weren't working before they seem to be working now. I just tried loading the modules on Hera.

climbfuji commented 3 years ago

If they weren't working before they seem to be working now. I just tried loading the modules on Hera.

I had success loading them on one of the login nodes (hfe11), but compiling on the compute nodes failed. Maybe some compute nodes lost their /scratch1 mounts?

kgerheiser commented 3 years ago

@RobertoPadilla-NOAA you changed Jasper/2.0.15 to Jasper/1.900.1, or 1.900.1 to 2.0.15? That's something that should be investigated.

RobertoPadilla-NOAA commented 3 years ago

@kgerheiser on Orion I changed jasper/2.0.15 to jasper/1.900.1 in order to build ww3_grib properly.

RobertoPadilla-NOAA commented 3 years ago

On Hera I'm working on scrath1, and I'm compiling on the login nodes hfe04 and hfe10 and loading hpc-stack fails.

kgerheiser commented 3 years ago

In what way does it fail?

I'm on hfe04 on scratch1 at /scratch1/NCEPDEV/nems/Kyle.Gerheiser

I run:

module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack module load hpc/1.1.0 module load hpc-intel etc

And it works.

RobertoPadilla-NOAA commented 3 years ago

I don't know what is happening On Hera Several days ago I was using this file /scratch1/NCEPDEV/stmp2/Roberto.Padilla/GitHub/WW3_hpc-satck_test/modulefiles/modulefile.ww3.hera_Original script to load the hpc-stack, that contains module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/modulefiles/stack module load hpc/1.0.0 It was working. Now that path it doesn't work, the one you sent has an extra "hpc-stack" in the path. and notice the hpc module version, 1.0.0 was loading. That file (modulefile.ww3.hera_Original) was loading all modules, the issue was that it was producing a ww3_grib execuatable that produced empty grib2 files.

Ok, now I changed the path and loading (on the command line) [Roberto.Padilla@hfe10 Run_test]$ module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack [Roberto.Padilla@hfe10 Run_test]$ module load hpc/1.1.0 [Roberto.Padilla@hfe10 Run_test]$ module load hpc-intel/18.0.5.274 [Roberto.Padilla@hfe10 Run_test]$ module load hpc-impi/2018.0.4 [Roberto.Padilla@hfe10 Run_test]$ module load jasper/2.0.15 Lmod has detected the following error: The following module(s) are unknown: "jasper/2.0.15"

Please check the spelling or version number. Also try "module spider ..." It is also possible your cache file is out-of-date; it may help to try: $ module --ignore-cache load "jasper/2.0.15"

Also make sure that all modulefiles written in TCL start with the string #%Module

Thanks, Roberto

climbfuji commented 3 years ago

I don't know what is happening On Hera Several days ago I was using this file /scratch1/NCEPDEV/stmp2/Roberto.Padilla/GitHub/WW3_hpc-satck_test/modulefiles/modulefile.ww3.hera_Original script to load the hpc-stack, that contains module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/modulefiles/stack module load hpc/1.0.0 It was working. Now that path it doesn't work, the one you sent has an extra "hpc-stack" in the path. and notice the hpc module version, 1.0.0 was loading. That file (modulefile.ww3.hera_Original) was loading all modules, the issue was that it was producing a ww3_grib execuatable that produced empty grib2 files.

Ok, now I changed the path and loading (on the command line) [Roberto.Padilla@hfe10 Run_test]$ module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack [Roberto.Padilla@hfe10 Run_test]$ module load hpc/1.1.0 [Roberto.Padilla@hfe10 Run_test]$ module load hpc-intel/18.0.5.274 [Roberto.Padilla@hfe10 Run_test]$ module load hpc-impi/2018.0.4 [Roberto.Padilla@hfe10 Run_test]$ module load jasper/2.0.15 Lmod has detected the following error: The following module(s) are unknown: "jasper/2.0.15"

Please check the spelling or version number. Also try "module spider ..." It is also possible your cache file is out-of-date; it may help to try: $ module --ignore-cache load "jasper/2.0.15"

Also make sure that all modulefiles written in TCL start with the string #%Module

Thanks, Roberto

Please note that there was a filesystem problem last night, resulting in about 45TB of corrupted=lost data.

RobertoPadilla-NOAA commented 3 years ago

[Roberto.Padilla@hfe10 Run_test]$ module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack [Roberto.Padilla@hfe10 Run_test]$ module load bacio/2.4.0 Lmod has detected the following error: The following module(s) are unknown: "bacio/2.4.0"

Please check the spelling or version number. Also try "module spider ..." It is also possible your cache file is out-of-date; it may help to try: $ module --ignore-cache load "bacio/2.4.0" Lmod has detected the following error: The following module(s) are unknown: "g2/3.4.0" Lmod has detected the following error: The following module(s) are unknown: "ip/3.3.0" Lmod has detected the following error: The following module(s) are unknown: "nemsio/2.5.1"

aerorahul commented 3 years ago

@RobertoPadilla-NOAA bacio needs the intel compiler module loaded.

module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack
module avail
module load hpc/1.1.0
module load hpc-intel
module load bacio/2.4.1
module list

Currently Loaded Modules:
  1) hpc/1.1.0   2) intel/18.0.5.274   3) hpc-intel/18.0.5.274   4) bacio/2.4.1
RobertoPadilla-NOAA commented 3 years ago

@climbfuji, yes, that was my question in the first comments of today, that if the filesystem failure was affecting the loading of hpc-stack?.
None of the modules are loading using module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/modulefiles/stack

aerorahul commented 3 years ago

@climbfuji, yes, that was my question in the first comments of today, that if the filesystem failure was affecting the loading of hpc-stack?. None of the modules are loading using module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/modulefiles/stack

@RobertoPadilla-NOAA You are not using the modules correctly. You are using the stack with the module use, but you still need to load the modules with a module load in a hierarchical manner.

The correct use of hpc-stack and the software stack underneath it is outlined here

kgerheiser commented 3 years ago

The new version of hpc-stack also has updated some libraries (like bacio is version 2.4.1 now), so that's why it's not finding the versions you specified. The updated libraries should have no affect on your code or have any change in results (mainly build system changes).

JessicaMeixner-NOAA commented 3 years ago

@RobertoPadilla-NOAA do I need to help make the test case or a file using the new module versions of hpc-stack?

kgerheiser commented 3 years ago

I'm not sure about what you were originally using at /scratch2/NCEPDEV/nwprod/hpc-stack/libs/modulefiles/stack as there's nothing there anymore. I didn't touch it, and it's on scratch2 which supposedly isn't affected by the data loss.

That seems to be an old version of hpc-stack (which would also contribute to your wgrib2 problem).

I would update to use the most recent version of hpc-stack if you can, if to just get the updated wgrib2.

RobertoPadilla-NOAA commented 3 years ago

@JessicaMeixner-NOAA if you can help making the file with the new module versions of hpc-stack will be great. Thanks.

kgerheiser commented 3 years ago

Try this:

#%Module######################################################################
## module for ww3 before base uses hpc-stack
module use /contrib/sutils/modulefiles
module load sutils

module load cmake/3.16.1

module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack

module load hpc/1.1.0
module load hpc-intel/18.0.5.274
module load hpc-impi/2018.0.4

module load jasper/2.0.22
module load zlib/1.2.11
module load png/1.6.35

module load hdf5/1.10.6
module load netcdf/4.7.4
module load esmf/8_1_0_beta_snapshot_27

module load bacio/2.4.1
module load crtm/2.3.0
module load g2/3.4.1
module load g2tmpl/1.9.1
module load ip/3.3.3
module load upp/10.0.0
module load nemsio/2.5.2
module load sp/2.3.3
module load w3emc/2.7.3
module load w3nco/2.4.1
RobertoPadilla-NOAA commented 3 years ago

@kgerheiser it failed to load only g2 Lmod has detected the following error: The following module(s) are unknown: "g2/3.4.0"

aerorahul commented 3 years ago

@RobertoPadilla-NOAA The g2 version is 3.4.1 in @kgerheiser message ⬆️

kgerheiser commented 3 years ago

I stealthily edited that after I put in 3.4.0 :)

RobertoPadilla-NOAA commented 3 years ago

All modules have been loaded, all programs have been compiled but the link has failed. I will finish the canned tests and going back to you.

RobertoPadilla-NOAA commented 3 years ago

All, test to reproduce the issue (it uses hpc-stack) and test where the issue has been solved (NO hpc-stack) are available on Hera at /scratch1/NCEPDEV/stmp2/Roberto.Padilla/GitHub/WW3_hpc-stack_test Inside of the subdirectory Run_test/ you will find two scripts to run the cases. In the main directory (WW3_hpc-stack_test) you'll find a README file, where it is explained the content of the subdirectories, an where to find the input and final product from ww3_grib (exec that should produce the grib2 files).

kgerheiser commented 3 years ago

I was able to run it.

Where can I find the actual wgrib2 commands used to create the file?

Ideally, I would run the same wgrib2 command with the relevant input to produce the file.

Also, you think it's related to Jasper? And it works with Jasper 1.9?

aerorahul commented 3 years ago

I was able to run it.

Where can I find the actual wgrib2 commands used to create the file?

Also, you think it's related to Jasper? And it works with Jasper 1.9?

I don't think they are running wgrib2? I think they are building an executable ww3_grib and running that on a gribfile to produce some more grib files.

I am also looking for what the command actually is and what is being used to build ww3_grib. I can see they are setting FC to gfortran. Why? I don't know.

kgerheiser commented 3 years ago

Oh, I was under the impression it was a wgrib2 issue.

JessicaMeixner-NOAA commented 3 years ago

Yes - we run ww3_grib, which is working w/the non-hpc stack modules and creates empty grib files with. So this is likely some module setting, compile, linking type issue.

@aerorahul the FC to gfortran is for the programs that pre-process the *.ftn files into the fortran files. It's not what the actual code is using.

JessicaMeixner-NOAA commented 3 years ago

@RobertoPadilla-NOAA which version of WW3 this is either because a permissions issue or it's not actually a git repo (which I think might be the case as I don't see parts of the repository here).

aerorahul commented 3 years ago

can you please show the output of make VERBOSE=ON when building ww3_grib? It sure looks like that should be somewhere in w3_make.

RobertoPadilla-NOAA commented 3 years ago

@JessicaMeixner-NOAA for WW3 is the version that we are using in the coupled system. You don't see parts of the git repository because you suggested (and it was a good suggestion) to make a simple test not to run the coupled system. I just copied WW3/model from sorc/fv3_coupled.fd/WW3/model eliminating al other directories. @aerorahul @kgerheiser The executable ww3_grib produces grib2 files from binary files. Please take into account that the compilations of the WW3 program (ww3_grib) is exactly the same for the non-hpc-stack test as for the hpc-stack, using or not the hpc-stack is the only difference. @kgerheiser Could be also related to jasper, but in Hera we don't have jasper/1.900.1 as in Orion.

JessicaMeixner-NOAA commented 3 years ago

@RobertoPadilla-NOAA simplifying so that the entire global-workflow is not needed does mean that we need to simplify so much that we don't even include the entire WW3 code so we cannot do things like "git log" and see which git hash you are using. "The version that is in the coupled system" is not specific enough of a version.

arunchawla-NOAA commented 3 years ago

Did we ever figure the issue out for this?

aerorahul commented 3 years ago

I having difficulty inserting debugging elements in the build for this program. I asked earlier how to turn make VERBOSE=ON, but have not heard back.

JessicaMeixner-NOAA commented 3 years ago

@RobertoPadilla-NOAA told me he was taking care of providing more verbose (we don't have an option VERBOSE=ON) in WW3.

However, I'm really curious how there is a case on orion that does work with hpc-stack. That gives me hope this is a simple issue.

JessicaMeixner-NOAA commented 3 years ago

Based on a PR from Roberto to the global-workflow feature/coupled-crow branch: https://github.com/NOAA-EMC/global-workflow/pull/248 which uses hpc-stack 1.0.0 on orion the issue seems to be from the version of Jasper: That jasper/2.0.15 does not work and jasper/1.900.1 does work.

@arunchawla-NOAA mentioned that some features were deprecated and that warnings in 1.9.00.1 might now be the errors in 2.0.15 - that should be looked at.

I'm currently copying Roberto's test case on hera and adding the verbose options and will post that shortly. As the 1.9.00.1 versus 2.0.15 seems to possibly be an issue and both versions are on orion and not on hera, I'm also going to set up a test there for comparison of the jasper version. On hera for 2.0.15 versus the older, one issue is that we define for the build: JASPER_LIB=$JASPER_ROOT/lib/libjasper.a but for 2.0.15 it should be: JASPER_LIB=$JASPER_ROOT/lib64/libjasper.a

which might also be a contributing issue. @RobertoPadilla-NOAA please update the issue with any additional information you might have.

RobertoPadilla-NOAA commented 3 years ago

@JessicaMeixner-NOAA I have no more information than that was already provided.

arunchawla-NOAA commented 3 years ago

If the path is the case then where is the JASPER_LIB path being defined? Main module file ? I suggest we take care of it there and then can close this ticket. Why did we not get a failure for library not found in the build if the path was wrong?

JessicaMeixner-NOAA commented 3 years ago

Actually, I saw that in Roberto's scripts he has the JASPER_LIB path defined correctly, I just didn't in my case. It was expected by WW3 that JASPER_LIB and other things were defined by the modules before. Since they aren't now we define them in the build script in global-workflow

JessicaMeixner-NOAA commented 3 years ago

@RobertoPadilla-NOAA can you please give me the hash of the WW3 version you used to make the test in: /scratch1/NCEPDEV/stmp2/Roberto.Padilla/GitHub/WW3_hpc-stack_test I'm trying to use your test but with the full WW3 but keep getting errors because of versions.

RobertoPadilla-NOAA commented 3 years ago

@JessicaMeixner-NOAA, the WW3 that I included in the test is a copy of the one the version that is in the coupled-system.

[Roberto.Padilla@hfe08 WW3]$ git log commit 9c22b13506e797940ebab538fe4a3940dd9e3fc0 (HEAD) Author: Ali.Abdolali 37336972+aliabdolali@users.noreply.github.com Date: Mon Oct 19 10:30:32 2020 -0400

RobertoPadilla-NOAA commented 3 years ago

@arunchawla-NOAA @JessicaMeixner-NOAA. Question from Arun "Why did we not get a failure for library not found in the build if the path was wrong?" Adding to a comment from Jessica. For jasper/2.0.15 the path of the library was correct (JASPER_LIB=$JASPER_ROOT/lib64/libjasper.a) that was the reason we didn't get library not found. But for jasper/1.900.1 the path is different (JASPER_LIB=$JASPER_ROOT/lib/libjasper.a) and this one is working on Orion.