Closed JessicaMeixner-NOAA closed 3 years ago
@jessica, the path to metis/parmetis on orion is to the ones compiled using hpc-stack /work/noaa/marine/ali.abdolali/Source/hpc-stack/parmetis-4.0.3/lib I checked the matrix_ncep and it is referred to the above-mentioned path. Do you get failure using it?
They're build with hpc-intel/2019.5 but ufs-weather-model uses hpc-intel/2018.4 (see: https://github.com/ufs-community/ufs-weather-model/blob/develop/modulefiles/ufs_orion.intel#L16-L18) so I was currently switching to use that intel unless there's a reason we should deviate from that?
@JessicaMeixner-NOAA I just removed the one with intel./2019 and compiled them with the same version of hpc stack module use /apps/contrib/NCEP/libs/hpc-stack/modulefiles/stack
module load hpc/1.1.0 module load hpc-intel/2018.4 module load hpc-impi/2018.4
the path did not change: /work/noaa/marine/ali.abdolali/Source/hpc-stack/parmetis-4.0.3/lib
Thanks @aliabdolali the PDLIB tests now seem to be passing.
Current issues are:
FYI @ricampos
I can get past the segfaults I was having by adding:
ulimit -s unlimited
Now I have run into https://github.com/NOAA-EMC/WW3/issues/442
Thanks, Jessica. I will leave a note for me to remember to add this line.
Okay at this point I have a branch that runs everything on orion except for the netcdf output with the partitions, those tests still fail.
@aliabdolali @ricampos should I go ahead and make a PR with the updates as of now or wait until we have a fix for the netcdf issues on orion?
@JessicaMeixner-NOAA Thanks, please go ahead and make the PR. If needed, please make an issue associated with this problem.
Hi Jessica, I found the problem on Orion. When ww3_ounf is compiled with netcdf/4.7.4 the program crashes during partition writing with the message "NetCDF: Name contains illegal characters" as you saw. It partially writes the file (without partitions) and then stop, but the problematic netcdf file is created. When I recompiled the model with netcdf/4.7.2 , ww3_ounf worked nicely. All good. See results at: /work/noaa/marine/ricardo.campos/models/WW3/regtests/ww3_ufs1.3/output I compared the partition characters and text, with the non-partition variables. And I tried to edit w3ounfmetamd, but I didn't manage to make it work with netcdf/4.7.4 . Only with netcdf/4.7.2.
From now on I will always use module load netcdf/4.7.2 in my jobscripts.
There was an issue when running the regtests on hera, I thought I had solved that problem, but I guess not. So no pull request yet for this branch.
@ricampos while netcdf/4.7.2 solving the problem is great, that's not an hpc-stack module which is what we want to use. Let's make a new issue for just the netcdf problem problem on orion, using the hpc-stack modules instead. If needed we might need to create a simple test case that we can post on an issue on hpc-stack itself if need be.
Understood. But what if this is a netcdf/4.7.4 issue instead of a WW3 issue?
Understood. But what if this is a netcdf/4.7.4 issue instead of a WW3 issue?
It works with netcdf/4.7.4 on hera I'll make a new issue -- let's continue this conversation there.
ok
We should be able to run matrix_ncep on orion. The first issue was to change mpirun to srun which is the command we should be using on hera too for slurm. Still debugging issues on orion that include: -- I think the parmetis library needs to be rebuilt now that we're using hpc-stack modules on orion (@aliabdolali can you help with this?) -- the oasis tests fail see question in issue #440
This work is being done on: https://github.com/JessicaMeixner-NOAA/WW3/tree/orion
When completed, the hope is to be able to use the hpc-stack modules on orion and run the WW3 regression tests on orion as well as hera.