NOAA-EMC / gfs-utils

Utility programs for global-workflow
0 stars 15 forks source link

mkgfsawps.x hangs on Hera with spack-stack-installed w3emc/2.10.0 #33

Closed DavidHuber-NOAA closed 7 months ago

DavidHuber-NOAA commented 7 months ago

The mkgfsawps.x application is hanging on Hera in the w3emc/2.10.0 subroutine getgb within any of the gfsawips_f* global workflow jobs on Hera. This is the default version of w3emc installed with spack-stack/1.5.1. The specific line that it is hanging on is at the calculation at w3fi63.f:3116.

To replicate, clone, checkout, build the global workflow:

git clone git@github.com:DavidHuber-NOAA/global-workflow -b feature/spack-stack
cd global-workflow/sorc
./checkout.sh -g
./build_all.sh

Start a new low-res experiment:

cd ..
module use modulefiles
module load module_gwsetup.hera
cd workflow
./setup_expt.py gfs cycled --pslot ss_151 --resdet 96 --resens 48 --nens 2 --comrot <comrot_dir> --expdir <expdir> --icsdir /scratch1/NCEPDEV/nems/David.Huber/noscrub/global_ICs/96/2022110818 --idate 2022110818 --edate 2022110900

Edit the resulting config.base in <expdir>/ss_151/config.base and set DO_AWIPS="YES" and run a lower cost experiment by setting DOHYBVAR="NO" l4densvar=".false." FHMAX_GFS_*=06

then run ./setup_xml.py <expdir>/ss_151.

Finally, run the experiment through the 2022110900 cycle awips jobs via rocotorun. The log file for the awips job will be written to $ROTDIR/logs/2022110900/gfsawips_f006-f006.log and the output of mkgfsawps.x will be written to $DATAROOT/awips.<jobid1>/awips_g1/OUTFILE.<jobid2> (DATAROOT and ROTDIR are defined in config.base).

FYI @aerorahul

aerorahul commented 7 months ago

@DavidHuber-NOAA If you can save the run directory and have a case for someone to run just the hanging executable, it might help debug this issue sooner.

aerorahul commented 7 months ago

@GwenChen-NOAA The global-workflow is updating the utilities to latest available versions via spack-stack. These will eventually (definitely before implementation) be updated on WCOSS2. @DavidHuber-NOAA is finding that this utility hangs with these updates. Any help is appreciated. This is currently a blocker for spack-stack migration (unless we disable product generation aspects of the system for development purposes)

DavidHuber-NOAA commented 7 months ago

@aerorahul Alright, there are some environmental variables that need to be set so I will write a run script.

aerorahul commented 7 months ago

Tagging @edwardhartnett for insights on w3emc.

DavidHuber-NOAA commented 7 months ago

I've created a sample case in /scratch1/NCEPDEV/global/David.Huber/noscrub/awips. To run it, execute the script ./run_mkgfsawps. Currently, it points to my mkgfsawps.x in /scratch1/NCEPDEV/nems/David.Huber/GW/gw_spack-stack_151/sorc/gfs_utils.fd/install/mkgfsawps.x, but this can be changed by setting the awps_exec variable within run_mkgfsawps.

To build gfs_utils:

git clone git@github.com:DavidHuber-NOAA/gfs-utils -b feature/spack-stack
cd gfs-utils/ush
./build.sh

If you want to build your own w3emc library, you will need to build it with 8-byte integer support with the -DBUILD_8=ON cmake flag. Once built, modify the modulefile gfs-utils/modulefiles/gfsutils_common.lua to point to it by commenting out the lines

load(pathJoin("w3emc", w3emc_ver))
load(pathJoin("nemsio", nemsio_ver))

and adding the following environmental variable declarations

--NEMSIO (needed because the spack-stack nemsio module loads the spack-stack w3emc module)
prepend_path("PATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq/bin", ":")
prepend_path("LD_LIBRARY_PATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq/lib64", ":")
prepend_path("DYLD_LIBRARY_PATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq/lib64", ":")
prepend_path("CPATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq/include", ":")
prepend_path("CMAKE_PREFIX_PATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq/.", ":")
prepend_path("PATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq/bin", ":")
prepend_path("CMAKE_PREFIX_PATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq/.", ":")
setenv("nemsio_ROOT", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq")
setenv("NEMSIO_INC", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq/include")
setenv("NEMSIO_LIB", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq/lib64/libnemsio.a")
setenv("MKGFSNEMSIOCTL", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq/mkgfsnemsioctl")
setenv("NEMSIO_CHGDATE", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq/bin/nemsio_chgdate")
setenv("NEMSIO_GET", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq/bin/nemsio_get")
setenv("NEMSIO_READ", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon/install/intel/2021.5.0/nemsio-2.5.4-xufwjaq/bin/nemsio_read")

--W3EMC
prepend_path("LD_LIBRARY_PATH", "/path/to/w3emc/build/install/lib64", ":")
prepend_path("DYLD_LIBRARY_PATH", "/path/to/w3emc/build/install/lib64", ":")
prepend_path("CMAKE_PREFIX_PATH", "/path/to/w3emc/build/install/.", ":")
prepend_path("CMAKE_PREFIX_PATH", "/path/to/w3emc/build/install/.", ":")
setenv("W3EMC_LIB4", "/path/to/w3emc_4.a")
setenv("W3EMC_INC4", "/path/to/w3emc/build/install/include_4")
setenv("W3EMC_LIB8", "/path/to/w3emc_8.a")
setenv("W3EMC_INC8", "/path/to/w3emc/build/install/include_8")
setenv("W3EMC_LIBd", "/path/to/w3emc_d.a")
setenv("W3EMC_INCd", "/path/to/w3emc/build/install/include_d")
setenv("w3emc_ROOT", "/path/to/w3emc/build/install")
AlexanderRichert-NOAA commented 7 months ago

I built w3emc with BUILD_WITH_BUFR=ON; switching off BUILD_WITH_EXTRA_DEPS is trickier because like grib_util apparently needs it. It's under /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon-w3emc (everything outside of w3emc & its dependents is the same as 'gsi-addon').

DavidHuber-NOAA commented 7 months ago

@AlexanderRichert-NOAA Thanks for the build. I don't see a Core directory under gsi-addon-w3emc's modulefiles directory, so I ended up just manually writing the paths for w3emc and nemsio into the build/run scripts. It's been running on a node for 5 minutes now and I don't see any progress, so I think it is still hanging. I will let you know if it gets any further.

DavidHuber-NOAA commented 7 months ago

The test timed out after 20 minutes with the only log output occurring in the first second, so I think it is still hanging.

AlexanderRichert-NOAA commented 7 months ago

When I run it through GDB, I get a segfault at w3fi63.f:3155. When I enable debugging for w3emc and mkgfsawps.x, it runs through, but gives a bunch of output conversion errors (units -1 and 6), and the output file created is empty. Also, **** (PDS) IN RECORD DOES NOT MATCH (PDS) IN CONTROL CARD and BULLETINS MISSING = -500080763 :)

AlexanderRichert-NOAA commented 7 months ago

Try using the _4 version of w3emc. I think it has to do with our switching ip to _4, and possibly also the fact that makgds() used to live in ip but is now under w3emc.

DavidHuber-NOAA commented 7 months ago

The standalone worked! Thanks @AlexanderRichert-NOAA, I will try this out in the jobs.

DavidHuber-NOAA commented 7 months ago

All jobs were successful. Thanks again! I will close this now and continue capturing updates to the spack-stack upgrade in #23.