NOAA-EMC / hpc-stack

Create a software stack for HPC's
GNU Lesser General Public License v2.1
30 stars 36 forks source link

Compiling with hpc-stack a WW3 wave model program produces empty grib2 files #137

Closed RobertoPadilla-NOAA closed 3 years ago

RobertoPadilla-NOAA commented 3 years ago

Describe the bug The use of the hpc-stack to compile a program to produce grib2 files from wave-component of the coupled system allows to build the executable but this executable produces empty grib2 files. If the hpc-stack is not used and modules are loaded separately then the executable produces valid grib2 files.

To Reproduce ==Install and run the test as described at https://github.com/NOAA-EMC/global-workflow/blob/feature/coupled-crow/README.md == Except for the first steps "Checkout the source code and scripts" == Use the following instructions git clone https://github.com/Jessica-Meixner-NOAA/global-workflow coupled-workflow cd coupled-workflow git checkout feature/p5ww3post git submodule update --init --recursive #Update submodules

==Follow the instructions . cd sorc sh checkout.sh coupled # Check out the coupled code, EMC_post, gsi, ... sh build_ncep_post.sh #This command will build ncep_post sh build_ww3prepost.sh #This command will build ww3 prep and post exes sh build_fv3_coupled.sh #This command will build ufs-s2s-model sh build_reg2grb2.sh #This command will build exes for ocean-ice post =To link fixed files and executable programs for the coupled application: =On Hera: sh link_fv3gfs.sh emc hera coupled =On Orion: sh link_fv3gfs.sh emc orion coupled d ../workflow cp user.yaml.default user.yaml =Then, open and edit user.yaml: =EXPROOT: Place for experiment directory, make sure you have write access. =FIX_SCRUB: True if you would like to fix the path to ROTDIR(under COMROOT) and RUNDIR(under DATAROOT) False if you would like CROW to detect available disk space automatically. *** Please use FIX_SCRUB: True on Hera/Orion until further notice (2020/03) =COMROOT: Place to generate ROTDIR for this experiment. =DATAROOT: Place for temporary storage for each job of this experiment. =cpu_project: cpu project that you are working with. =hpss_project: hpss project that you are working with.

==IMPORTANT, next step is different from the git page =In HERA ./setup_case.sh -p HERA ../cases/coupled_free_forecast_wave.yaml test2d =In ORION ./setup_case.sh -p ORION ../cases/coupled_free_forecast_wave.yaml test2d

=This will create a experiment directory ($EXPERIMENT_DIRECTORY). In the current example, $EXPERIMENT_DIRECTORY=$EXPROOT/test2d.

=For ORION: First make sure you have python loaded: module load contrib module load rocoto #Make sure to use 1.3.2 module load intelpython3

./make_rocoto_xml_for.sh $EXPERIMENT_DIRECTORY

=Run the model using the workflow cd $EXPERIMENT_DIRECTORY module load rocoto =Run several time rocotorun until all process are done rocotorun -w workflow.xml -d workflow.db =Check the status of your test rocotostat -w workflow.xml -d workflow.db

=You'll find the grib2 files in the directory you created: cd $COMROOT/test2d/gfs.20130401/00/wave/gridded wgrib2 -V gfswave.t00z.global.0p50.f000.grib2

=You'll see that all wave variables (min, averge, max) have the same value.

Expected behavior Produce valid grib2 files from the wave model using the modules from the phc-stack.

System: Hera and Orion

Additional context Add any other context about the problem here.

JessicaMeixner-NOAA commented 3 years ago

So using hpc/1.0.0 on Orion I can confirm that the "empty gribs" are coming from using jasper/2.0.15 as that gives me empty grib files and jasper/1.900.1 gives you non-empty grib files.

In both versions of the module I get this warning: /apps/jasper-1.900.1/lib/libjasper.a(jas_stream.o): In function jas_stream_tmpfile': /tmp/jasper-1.900.1/src/libjasper/base/jas_stream.c:368: warning: the use oftmpnam' is dangerous, better use `mkstemp'

and we are still getting this warning in the new version of jasper: /apps/contrib/NCEP/libs/intel-2018.4/jasper/2.0.15/lib64/libjasper.a(jas_stream.c.o): In function jas_stream_tmpfile': jas_stream.c:(.text+0x994): warning: the use oftmpnam' is dangerous, better use `mkstemp'

The test is a slightly modified version from Roberto on Orion. You can find the tarball at: /work/noaa/marine/jmeixner/WW3GRIBISSUE/WW3_hpc-stack_test.tar.gz

After you un-tar it, go to the Run_test directory and then you can either run Jasper 1.9 version with: build_run_WW3_hpc-stack.jasper1.9.sh or Jasper 2.0 with build_run_WW3_hpc-stack.jasper2.sh. This builds and runs the model, just like the test Roberto gave; there are more prints from the WW3 build. There is not just a simple "VERBOSE=Y" option and we still might not be getting the information we want.

I'm happy to get more information to be printed out or if anyone has suggestions for things to debug. I do at least think we can narrow this down to being a jasper issue, but other than that I'm at a loss.

kgerheiser commented 3 years ago

The warning about tmpnam is gone in a newer version of Jasper, but that doesn't relate to your problem.

What is called to create the grib file? Is it some library and then that calls Jasper? I think it would be best to focus on the actual call to Jasper and the call where an empty grib file comes out.

JessicaMeixner-NOAA commented 3 years ago

There is a WW3 program ww3_grib that is calling Jasper and other libraries to create the grib file. I'm not familiar enough with grib or jasper to know which are the actual calls to Jasper... but I can start putting print statements in ww3_grib to see where the "empty grib" message is coming from. I tried to look on jasper's website for what changed between 1.9.00.1 and + but only found information on the changes for the different 2+ versions.

JessicaMeixner-NOAA commented 3 years ago

@kgerheiser the ww3_grib program is here: WW3/model/ftn/ww3_grib.ftn which gets preprocessed and the processed version of the file, which exists after you compile would be at WW3/model/tmp/ww3_grib.F90. Just remember if you want to change code you need to change the .ftn file not the .F90 file. I'll let everyone know when I have some print statements and have narrowed down he call to jasper or whatever is giving us the empty grib files

arunchawla-NOAA commented 3 years ago

The grib conversion is done in a routine called w3exgb in ww3_grib, large part of that code is spent in creating ID numbers for individual fields and converting them from an internal representation of arrays in WW3 (a single column of sea points) to a 2D field. The real GRIB output is done on lines 1445 onwards

Do not worry about the !/NCEP lines, the only ones to focus is on the !/NCEP2 lines, they are invoked for grib 2 output

Programs that are called are gribcreate, add field, gribend and write

arunchawla-NOAA commented 3 years ago

that should be wryte. MAC is trying to correct my English. Which library these functions are in we will have to hunt down. One of them is the problem child

kgerheiser commented 3 years ago

Ok, so those come from NCEPLIBS-g2, Which would make sense because it uses Jasper.

Seems like the actual problem is NCEPLIBS-g2.

JessicaMeixner-NOAA commented 3 years ago

So I narrowed down where the empty grib message is coming from:

processed code:

            WRITE(*,*) 'JDM CALL ADDFIELD'
            CALL ADDFIELD (CGRIB,LCGRIB,KPDSNUM,KPDS,200,        &
                           COORDLIST, NUMCOORD, IDRSNUM, IDRS,   &
                           200,X1, NDATA, IBMP, BITMAP, IO)
            WRITE(*,*) 'JDM BEFORE 2nd CALL to GOTO820', IO

Output:

 JDM CALL ADDFIELD
warning: empty layer generated
 JDM BEFORE 2nd CALL to GOTO820           0

I have a slightly simpler test case, which only writes out significant wave height HS and included the extra write comments in the code which is bundled on orion here: /work/noaa/marine/jmeixner/WW3GRIBISSUE/WW3_hpc-stack_test.2.tar.gz @kgerheiser let me know if there is something I need to do.

JessicaMeixner-NOAA commented 3 years ago

I printed out the values of what is going to the ADDFIELD call, for when we are using jasper 1.9 or 2. The CGRIB is binary, but comparing the other fields they are identical. I can point someone to that output if it would be useful information.

kgerheiser commented 3 years ago

I think I have a fix.

Can you replace your Jasper load with:

module use /work/noaa/nems/gkyle/hpc-stack/install/modulefiles
module load jasper/master

It's Jasper's master branch, which contains a fix for a bug that's related to Intel compilers and Jasper in the same file that throws that warning: empty layer generated.

JessicaMeixner-NOAA commented 3 years ago

It appears to have worked! $ pwd /work/noaa/marine/jmeixner/WW3GRIBISSUE/WW3_hpc-stack_test/test_jasper2 $ wgrib2 -V gribfile 1:0:vt=2013040100:surface:anl:HTSGW Significant Height of Combined Wind Waves and Swell [m]: ndata=231120:undef=88698:mean=2.79682:min=0:max=15.29 grid_template=0:winds(N/S): lat-lon grid:(720 x 321) units 1e-06 input WE:NS output WE:SN res 48 lat 80.000000 to -80.000000 by 0.500000 lon 0.000000 to 359.500000 by 0.500000 #points=231120

Thank you @kgerheiser !!!

JessicaMeixner-NOAA commented 3 years ago

I ran another test adding more variables back and plotted results and can confirm everything worked w/the new jasper from the master branch.

RobertoPadilla-NOAA commented 3 years ago

Was the solution tested in Hera and Orion? @JessicaMeixner-NOAA was the solution committed to the feature/coupled-crow branch? If so, was that branch tested or should I run a test on both machines?

JessicaMeixner-NOAA commented 3 years ago

@RobertoPadilla-NOAA it was tested on orion where @kgerheiser made the test module available. It will not be tested on hera or pushed to feature/coupled-crow until the new Jasper module is made available on all the machines through hpc-stack. At this point we're waiting for that to happen and no other action by you is required at this time.

arunchawla-NOAA commented 3 years ago

@kgerheiser Do we have an updated version of the Jasper libraries with hpc-stack that addresses this ?

kgerheiser commented 3 years ago

There is no released version with that fix yet. I took the fix from their main branch from a commit a few days ago.

We could install the master branch until a release version is out.

Or you could use the non hpc-stack version of Jasper.

arunchawla-NOAA commented 3 years ago

@kgerheiser a suggestion was made to use a different compression so we can avoid jasper all together.

arunchawla-NOAA commented 3 years ago

Jessica will reach out to you to see if we can use a non jasper compression algorithm in the library call

arunchawla-NOAA commented 3 years ago

@kgerheiser did we make any progress in removing jasper from the g2 library ? Did that work ?

kgerheiser commented 3 years ago

No, I haven't done that yet. I will see if I can do that today.

WalterKolczynski-NOAA commented 3 years ago

If Jasper can't be removed quickly, I'm hoping we can get a point release of hpc-stack with the Jasper fix so we can move forward in the meantime.

kgerheiser commented 3 years ago

Looks like there's a new release of Jasper, 2.0.25, with the fix. I think we should update to that immediately, and we'll continue to look at phasing out Jasper.

kgerheiser commented 3 years ago

@JessicaMeixner-NOAA or @WalterKolczynski-NOAA would you try out my nightly build of hpc-stack (develop)? I just want to make sure that the fix works before we install it everywhere.

Hera: /scratch1/NCEPDEV/stmp2/Kyle.Gerheiser/hpc-stack/nightly-develop/install/modulefiles/stack

Orion: /work/noaa/stmp/gkyle/stmp/gkyle/hpc-stack/nightly-develop/install/modulefiles/stack

I have it built on Hera and Orion and it contains Jasper 2.0.25.

WalterKolczynski-NOAA commented 3 years ago

@kgerheiser What about WCOSS Dell?

kgerheiser commented 3 years ago

I don't have a test build on there at the moment. I can do one if you like. I have a cron job set to build and test hpc-stack, but cron doesn't work on WCOSS Dell.

WalterKolczynski-NOAA commented 3 years ago

On Hera:

kgerheiser commented 3 years ago

The ESMF thing doesn't matter.

That's a good catch. We recently fixed that in the code so it wasn't hardcoded, but wgrib2 was missed. I have fixed it in the existing build.

WalterKolczynski-NOAA commented 3 years ago

I don't have a test build on there at the moment. I can do one if you like. I have a cron job set to build and test hpc-stack, but cron doesn't work on WCOSS Dell.

I've never had a problem with cron on WCOSS Dell. Are you using the mycrontab file?

kgerheiser commented 3 years ago

No, how do I do that?

WalterKolczynski-NOAA commented 3 years ago

No, how do I do that?

In your home directory, there should be a cron directory with a file named mycrontab inside. Works just like editing a normal crontab, except it will automatically be turned on/off when production switches (and you don't have to play 'which login node did I put the cron job on?').

JessicaMeixner-NOAA commented 3 years ago

@WalterKolczynski-NOAA do you have the testing done? I could run my quick test set-up for this case on orion if that would help. I'm just switching out the jasper or did you want me to use the whole hpc-stack from the nightly build?

kgerheiser commented 3 years ago

Just use the whole hpc-stack. Everything should work.

WalterKolczynski-NOAA commented 3 years ago

On Hera, a bunch more wrong envvar libs:

WalterKolczynski-NOAA commented 3 years ago

@WalterKolczynski-NOAA do you have the testing done? I could run my quick test set-up for this case on orion if that would help. I'm just switching out the jasper or did you want me to use the whole hpc-stack from the nightly build?

I'm trying to get everything built and setup now

kgerheiser commented 3 years ago

Yep, just realized that would happen. Sorry, about that. I fixed them.

WalterKolczynski-NOAA commented 3 years ago

It looks like Orion has the same lib variable issues.

JessicaMeixner-NOAA commented 3 years ago

@kgerheiser on my test on orion, I'm getting that the following two variables which I use when building the model:

G2_LIB4=/work/noaa/stmp/gkyle/stmp/gkyle/hpc-stack/nightly-develop/install/intel-2018.4/g2/3.4.1/lib64/libg2_4.a W3NCO_LIB4=/work/noaa/stmp/gkyle/stmp/gkyle/hpc-stack/nightly-develop/install/intel-2018.4/w3nco/2.4.1/lib64/libw3nco_4.a

don't actually exist.

The modules I used: module load contrib noaatools module load cmake/3.17.3 module use /work/noaa/stmp/gkyle/stmp/gkyle/hpc-stack/nightly-develop/install/modulefiles/stack module load hpc/1.1.0 module load hpc-intel/2018.4 module load hpc-impi/2018.4 module load jasper/2.0.25 module load zlib/1.2.11 module load png/1.6.35 module load hdf5/1.10.6 module load netcdf/4.7.4 module load esmf/8_1_0_beta_snapshot_27 module load bacio/2.4.1 module load crtm/2.3.0 module load g2/3.4.1 module load g2tmpl/1.9.1 module load ip/3.3.3 module load nceppost/dceca26 module load sp/2.3.3 module load w3emc/2.7.3 module load w3nco/2.4.1

kgerheiser commented 3 years ago

I believe I have fixed all the modules in both of the builds. I also put in a PR #163 to fix it.

WalterKolczynski-NOAA commented 3 years ago

@JessicaMeixner-NOAA looks like the pio version has to be updated to 2.5.2 as well

kgerheiser commented 3 years ago

PIO 2.5.1 will also be there, but 2.5.2 is now the version we're moving to. Feel free to remain on 2.5.1 for now.

WalterKolczynski-NOAA commented 3 years ago

It isn't available in the nightly build, which makes it difficult to test without changing.

JessicaMeixner-NOAA commented 3 years ago

It isn't available in the nightly build, which makes it difficult to test without changing.

I don't need pio to test, but I do need the libraries to exist/link to, to be able to test ww3_grib.

WalterKolczynski-NOAA commented 3 years ago

I needed it to build the model.

WalterKolczynski-NOAA commented 3 years ago

I've successfully built on both Hera and Orion using the nightly build.

JessicaMeixner-NOAA commented 3 years ago

@WalterKolczynski-NOAA what do you use for G2_LIB4 and W3NCO_LIB4 ?

WalterKolczynski-NOAA commented 3 years ago

I didn't make any changes to the model except the jasper and pio versions. The build log says:

G2_LIB4=/apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/g2/3.4.1/lib/libg2_4.a W3NCO_LIB4=/apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/w3nco/2.4.1/lib/libw3nco_4.a

JessicaMeixner-NOAA commented 3 years ago

I ran a test on orion and it worked for my test case @kgerheiser sorry it took a while