NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
74 stars 167 forks source link

PGBANL file not created on Hera using develop (as of Nov 17 2020) #201

Closed CatherineThomas-NOAA closed 3 years ago

CatherineThomas-NOAA commented 3 years ago

When running a low resolution (C384/C192/L127) parallel with the v16 configuration on Hera (develop as of Nov 17, 2020), it was found that the pgbanl files are all of zero size, but the pgb files for other forecast times are fine:

/scratch2/NCEPDEV/stmp1/Catherine.Thomas/ROTDIRS/dropout_metopc_rerun/gdas.20200603/12/atmos -rw-r--r-- 1 Catherine.Thomas stmp 0 Nov 28 23:26 gdas.t12z.pgrb2.0p25.anl -rw-r--r-- 1 Catherine.Thomas stmp 0 Nov 28 23:26 gdas.t12z.pgrb2.0p25.anl.idx -rw-r--r-- 1 Catherine.Thomas stmp 418492278 Nov 29 05:44 gdas.t12z.pgrb2.0p25.f000 -rw-r--r-- 1 Catherine.Thomas stmp 31796 Nov 29 05:44 gdas.t12z.pgrb2.0p25.f000.idx -rw-r--r-- 1 Catherine.Thomas stmp 447243373 Nov 29 05:45 gdas.t12z.pgrb2.0p25.f001 -rw-r--r-- 1 Catherine.Thomas stmp 40448 Nov 29 05:45 gdas.t12z.pgrb2.0p25.f001.idx

In gdaspost000.log, these lines appear: 0.366 + srun '--export=ALL' /scratch1/NCEPDEV/da/Catherine.Thomas/git/global-workflow/develop.20201117/exec/gfs_ncep_post 0.369 + 0< itag 1> outpost_gfs_2020060312_postcntrl_gfs_anl.xml /scratch1/NCEPDEV/da/Catherine.Thomas/git/global-workflow/develop.20201117/exec/gfs_ncep_post: error while loading shared libraries: libnetcdf.so.15: cannot open shared object file: No such file or directory ************************************************************* ** FATAL ERROR: Job post.159550 failed RETURN CODE 2 ** ABNORMAL EXIT at Sat Nov 28 23:26:13 EST 2020 on h8c53 *************************************************************

The job does not fail, however. It passes successfully and cycles on. A sample log file is here: /scratch2/NCEPDEV/stmp1/Catherine.Thomas/ROTDIRS/dropout_metopc_rerun/logs/2020060312/gdaspost000.log

RussTreadon-NOAA commented 3 years ago

The post000 job executes the UPP executable ncep_post. Other post jobs do not execute ncep_post. The model runs the inline post for forecast model output.

The workflow loads module_base.hera when it executes post000. This module loads

module use /scratch1/NCEPDEV/nems/emc.nemspara/soft/modulefiles/ module load netcdf_parallel/4.7.4.release module load hdf5_parallel/1.10.6.release

Module v8.0.0-hera is loaded when UPP executable is built. This module loads

module use -a /scratch2/NCEPDEV/nwprod/NCEPLIBS/modulefiles module load hdf5_parallel/1.10.6 module load netcdf_parallel/4.7.4

Differences in the modules used to build and run the UPP executable may be the reason for the error message

0.358 + srun '--export=ALL' /scratch1/NCEPDEV/da/Catherine.Thomas/git/global-workflow/develop.20201117/exec/gfs_ncep_post 0.362 + 0< itag 1> outpost_gfs_2020060406_postcntrl_gfs_anl.xml /scratch1/NCEPDEV/da/Catherine.Thomas/git/global-workflow/develop.20201117/exec/gfs_ncep_post: error while loading shared libraries: libnetcdf.so.15:\ cannot open shared object file: No such file or directory

RussTreadon-NOAA commented 3 years ago

Indeed, /scratch1/NCEPDEV/nems/emc.nemspara/soft/netcdf_parallel_release/lib/ does not contain libnetcdf.so.15. It only contains

$ ls -l /scratch1/NCEPDEV/nems/emc.nemspara/soft/netcdf_parallel_release/lib/libnetcdf.so* lrwxrwxrwx 1 emc.nemspara nems 19 Jul 1 20:15 libnetcdf.so -> libnetcdf.so.18.0.0 lrwxrwxrwx 1 emc.nemspara nems 19 Jul 1 20:15 libnetcdf.so.18 -> libnetcdf.so.18.0.0 -rwxr-xr-x 1 emc.nemspara nems 1431320 Jul 1 20:15 libnetcdf.so.18.0.0

In contrast, /scratch2/NCEPDEV/nwprod/NCEPLIBS/src/netcdf_parallel2/lib/ contains

$ ls -l /scratch2/NCEPDEV/nwprod/NCEPLIBS/src/netcdf_parallel2/lib/libnetcdf.so* lrwxrwxrwx 1 Hang.Lei nwprod 19 May 8 2020 libnetcdf.so -> libnetcdf.so.15.2.1 lrwxrwxrwx 1 Hang.Lei nwprod 19 May 8 2020 libnetcdf.so.15 -> libnetcdf.so.15.2.1 -rwxr-xr-x 1 Hang.Lei nwprod 1422400 May 8 2020 libnetcdf.so.15.2.1 lrwxrwxrwx 1 Hang.Lei nwprod 19 May 8 2020 libnetcdf.so.18 -> libnetcdf.so.18.0.0 -rwxr-xr-x 1 Hang.Lei nwprod 1431320 May 8 2020 libnetcdf.so.18.0.0

CatherineThomas-NOAA commented 3 years ago

Thanks, Russ. I am rerunning a gdaspost000 job with: module use -a /scratch2/NCEPDEV/nwprod/NCEPLIBS/modulefiles module load hdf5_parallel/1.10.6 module load netcdf_parallel/4.7.4

added to my module_base.hera file to see if the post analysis file is successfully created. I'm not necessarily advocating for this change to be added in this manner since I do not know what the impact on other jobs will be. I want to confirm that this is the only issue with the creation of these files first.

RussTreadon-NOAA commented 3 years ago

Did the gdaspost000 job successfully create a pgbanl file when running the job with the modules used to build the UPP executable? I agree that this is not a solution. It's simply a test. It's my understanding that we are moving to hpc-stack to build and run workflow executables. This will ensure consistency between compilation and execution.

CatherineThomas-NOAA commented 3 years ago

There is a problem with the module loading when switching out netcdf. The version of esmf that is loaded is not compatible with the added version of netcdf:

Lmod has detected the following error: Cannot load module "esmf/8.0.1_ParallelNetCDF.release". At least one of these module(s) must be loaded: netcdf_parallel/4.7.4.release

How close are we to using hpc-stack? Is it feasible to wait until then or do we need a short term fix?

RussTreadon-NOAA commented 3 years ago

ESMF is not used to build the UPP executable. You could try unloading esmf prior to loading the UPP netcdf module. Yes, this is very kludgy ... but we can't run vrfy as we normally do without the pgbanl file.

CatherineThomas-NOAA commented 3 years ago

I restored the original version of module_base.hera and instead made the module changes directly in post.sh:

module unload esmf/8.0.1_ParallelNetCDF.release module unload netcdf_parallel/4.7.4.release module unload hdf5_parallel/1.10.6.release module load hdf5_parallel/1.10.6 module load netcdf_parallel/4.7.4

I am rerunning the gdaspost000 job. I expect it will not run until tomorrow since Hera is exceedingly slow the past two weeks (the last test sat in the queue for 14 hours).

RussTreadon-NOAA commented 3 years ago

I copied your EXPDIR, HOMEgfs, and ROTDIRS to my directories. I added

module use -a /scratch2/NCEPDEV/nwprod/NCEPLIBS/modulefiles

to $HOMEgfs/jobs/rocoto/post.sh prior to loading hdf5_parallel/1.10.6 and netcdf_parallel/4.7.4. The queue was changed to queue to debug and the post wall time reduced to 30 minutes in the xml file. The 2020060212 gdaspost00 was submitted. It ran to completion. Non-zero length pgrbanl files were created in /scratch2/NCEPDEV/stmp1/Russ.Treadon/ROTDIRS/dropout_wavetest/gdas.20200602/12/atmos/.

Again, this isn't a long term solution. It's just a test.

I don't think we 96 pe to run the UPP for C384 atmanl files. Of course, these initial tests may simply be with the workflow "as is". Adjusting job configurations for C384/C192 may be the next step after everything is running OK.

CatherineThomas-NOAA commented 3 years ago

Thanks, Russ. I've added that extra line to my post.sh.

I am rerunning the other post jobs overnight just to make sure that there aren't any unintended consequences for the the non analysis files. I have saved the original files for comparison.

CatherineThomas-NOAA commented 3 years ago

The other post jobs for both the gdas and gfs cycles reproduce the original pgrb files with the other modules loaded.

I will include this code change in our common build for the GFSv16.1 tests on Hera.

RussTreadon-NOAA commented 3 years ago

I understand the necessity of this hack but maintaining a separate workflow for DA GFSv16.1 parallels on Hera is not desirable.

CatherineThomas-NOAA commented 3 years ago

Here are some options that I can think of:

  1. Maintain local code changes while we wait for hpc-stack. Do you know how far out these changes are from being used in the global-workflow? I have no sense of it. If it's on the order of months, it's likely too long to wait.
  2. Commit the code changes under a Hera-only IF block. We will need to remember to remove this later.
  3. Have libnetcdf.so.15 added to /scratch1/NCEPDEV/nems/emc.nemspara/soft/netcdf_parallel_release/lib. I have no idea how difficult this would be or if this is desired or not.
  4. Update the UPP to be compiled and run with the /scratch1/NCEPDEV/nems/emc.nemspara/soft/modulefiles/ modules. With the hpc-stack change coming, this is probably not a good option.

I honestly do not like any of these options, but if hpc-stack is not close to ready, then option 2 is probably the way to go. Thoughts? Any other options I missed?

RussTreadon-NOAA commented 3 years ago

global-workflow issue #164 is tracking the workflow transition to hpc-stack. A target completion date is not indicated. It's not clear that hpc-stack will solve the problem documented in this issue. It should, but this needs to be demonstrated.

NOAA-EMC/GSI issue #79 is tracking DA updates to use hpc-stack on Hera. We can ask Mike for an update.

If GFS v16.1 testing needs to begin in the very near future, option 2 is the best approach. This approach commits the hack to the repo which, in turn, allows all developers to reference a common version controlled workflow.

Is the PGBANL problem the only problem with regards to running v16 based parallels on Hera?

CatherineThomas-NOAA commented 3 years ago

Is the PGBANL problem the only problem with regards to running v16 based parallels on Hera?

There was another issue with GLDAS that has likely been resolved, but it's still being tested. That fix is less hack-y and will require a commit to the workflow.

The biggest issue right now with Hera is throughput. We are only getting a cycle a day for the dropout runs. I'm starting to explore the viability of using Orion instead for the v16.1 runs. It would be good if we can start these experiments within the next month but we have some flexibility.

@KateFriedman-NOAA What do you think about adding this hack to post.sh (in develop) to accommodate the pgrbanl issue listed above:

module unload esmf/8.0.1_ParallelNetCDF.release module unload netcdf_parallel/4.7.4.release module unload hdf5_parallel/1.10.6.release module use -a /scratch2/NCEPDEV/nwprod/NCEPLIBS/modulefiles module load hdf5_parallel/1.10.6 module load netcdf_parallel/4.7.4

If it sounds good to you, I will submit a PR for this change.

KateFriedman-NOAA commented 3 years ago

@CatherineThomas-NOAA Since post.sh is used on all machines let's move those module commands into env/HERA.env? There is a "post" section where they can go. That should limit the hack to just Hera. Please run a few post jobs with them in the env file to confirm that works. Did the non-post000 post jobs run ok with those modules switched?

Assuming it works with the HERA.env and the other post jobs also work then please submit a hotifx PR to develop. I'll get a ping when it's submitted. I'm otherwise unavailable today though. Thanks! :)

Also, there are efforts to move to the stack in most components and someone is about to do the workflow side for issue #164. I should be back Monday and can check on the progress. The GLDAS hotfix needs to get into develop as well.

CatherineThomas-NOAA commented 3 years ago

@KateFriedman-NOAA I mentioned this earlier in the thread but should have included in the code snippet itself that it would be in a Hera IF block so it wouldn't get triggered on other machines. But HERA.env is likely cleaner.

I tried adding those lines to HERA.env but it didn't work:

0.024 + module unload esmf/8.0.1_ParallelNetCDF.release /scratch1/NCEPDEV/da/Catherine.Thomas/git/global-workflow/develop.20201117_test/jobs/JGLOBAL_ATMOS_NCEPPOST[30]: module: not found [No such file or directory]

I added this line above the module lines in HERA.env, which allowed the module commands to be used:

source /apps/lmod/lmod/init/ksh

I'm not sure if this is the right way of going about it, but it had the intended effect. I tried the analysis post job, another gdas forecast post, and a gfs forecast post. All 3 jobs produced post files that matched with the originals, except for the gdas.tHHz.master.grb2ifFFF files. I believe that these are the index files for the master grib files and they have a header in them with a timestamp, so this may not be an issue, but I would like some confirmation on that. The master grib files and other lower resolution files are bit identical.

Let me know if my modification is good practice or not. If so, I'll submit a hotfix PR.

RussTreadon-NOAA commented 3 years ago

global-workflow issue #164 is tracking the workflow transition to hpc-stack. A target completion date is not indicated. It's not clear that hpc-stack will solve the problem documented in this issue. It should, but this needs to be demonstrated.

NOAA-EMC/GSI issue #79 is tracking DA updates to use hpc-stack on Hera. We can ask Mike for an update.

If GFS v16.1 testing needs to begin in the very near future, option 2 is the best approach. This approach commits the hack to the repo which, in turn, allows all developers to reference a common version controlled workflow.

Is the PGBANL problem the only problem with regards to running v16 based parallels on Hera?

The following has been done on Hera

The GSI run failed because the Hera hpc-stack does not define CRTM_FIX. Mike will follow up with the library team to have CRTM_FIX added to hpc-stack.

Rerun stand-alone rungsi script with CRTM_FIX manually defined. global_gsi.x ran to completion and reproduced output from run of the same script using NOAA-EMC/GSI master and non hpc-stack modules.

KateFriedman-NOAA commented 3 years ago

I added this line above the module lines in HERA.env, which allowed the module commands to be used: source /apps/lmod/lmod/init/ksh

That jives with something similar I had to do with a prep job hack this year so thanks for testing and confirming that.

So at runtime the following is sourced to define modules:

source "$HOMEgfs/modulefiles/module-setup.sh.inc"

That determines which shell is being used and then runs that same source as you listed but with the appropriate shell (if not ksh):

source /apps/lmod/lmod/init/$__ms_shell

...where "$__ms_shell" is either ksh, bash, or sh.

So this may also work in env/HERA.env and preserve the shell flexibility:

source "$HOMEgfs/modulefiles/module-setup.sh.inc"
module unload esmf/8.0.1_ParallelNetCDF.release
module unload netcdf_parallel/4.7.4.release
module unload hdf5_parallel/1.10.6.release
module use -a /scratch2/NCEPDEV/nwprod/NCEPLIBS/modulefiles
module load hdf5_parallel/1.10.6
module load netcdf_parallel/4.7.4

Would you mind testing that for me? If that works then you can submit the hotfix PR. Thanks!

CatherineThomas-NOAA commented 3 years ago

@KateFriedman-NOAA I added those lines to env/HERA.env and it did not work:

`/scratch1/NCEPDEV/da/Catherine.Thomas/git/global-workflow/develop.20201117_test/jobs/JGLOBAL_ATMOS_NCEPPOST[55]: setpdy.sh: not found [No such file or directory]'

I did some digging and found that "module-setup.sh.inc" contains a "module purge" in it, wiping out the module_base.hera modules. When I comment out this module purge, everything works, but this is obviously undesirable.

I tried to add: source /apps/lmod/lmod/init/$__ms_shell

directly to the JGLOBAL script, but the $__ms_shell variable is unavailable within JGLOBAL_ATMOS_NCEPPOST. I found that the module-setup.sh.inc uses "unset __ms_shell" at the end. I commented it out, still no. Added an export line to post.sh and it works. However again, this requires modifying module-setup.sh.inc in an undesirable way.

Any suggestions with how to move forward?

KateFriedman-NOAA commented 3 years ago

@CatherineThomas-NOAA Gotcha, thanks for testing that! I am setting up a hotfix branch today to merge a few things together for commit so I'll test this on Hera some more to find a decent solution.

KateFriedman-NOAA commented 3 years ago

Created branch "hotfixes" for this work and issues #202 and #208.