NOAA-EMC / WW3

WAVEWATCH III
Other
262 stars 530 forks source link

enable matrix_ncep on orion #441

Closed JessicaMeixner-NOAA closed 3 years ago

JessicaMeixner-NOAA commented 3 years ago

We should be able to run matrix_ncep on orion. The first issue was to change mpirun to srun which is the command we should be using on hera too for slurm. Still debugging issues on orion that include: -- I think the parmetis library needs to be rebuilt now that we're using hpc-stack modules on orion (@aliabdolali can you help with this?) -- the oasis tests fail see question in issue #440

This work is being done on: https://github.com/JessicaMeixner-NOAA/WW3/tree/orion

When completed, the hope is to be able to use the hpc-stack modules on orion and run the WW3 regression tests on orion as well as hera.

aliabdolali commented 3 years ago

@jessica, the path to metis/parmetis on orion is to the ones compiled using hpc-stack /work/noaa/marine/ali.abdolali/Source/hpc-stack/parmetis-4.0.3/lib I checked the matrix_ncep and it is referred to the above-mentioned path. Do you get failure using it?

JessicaMeixner-NOAA commented 3 years ago

They're build with hpc-intel/2019.5 but ufs-weather-model uses hpc-intel/2018.4 (see: https://github.com/ufs-community/ufs-weather-model/blob/develop/modulefiles/ufs_orion.intel#L16-L18) so I was currently switching to use that intel unless there's a reason we should deviate from that?

aliabdolali commented 3 years ago

@JessicaMeixner-NOAA I just removed the one with intel./2019 and compiled them with the same version of hpc stack module use /apps/contrib/NCEP/libs/hpc-stack/modulefiles/stack

module load hpc/1.1.0 module load hpc-intel/2018.4 module load hpc-impi/2018.4

the path did not change: /work/noaa/marine/ali.abdolali/Source/hpc-stack/parmetis-4.0.3/lib

JessicaMeixner-NOAA commented 3 years ago

Thanks @aliabdolali the PDLIB tests now seem to be passing.

Current issues are:

JessicaMeixner-NOAA commented 3 years ago

FYI @ricampos

JessicaMeixner-NOAA commented 3 years ago

I can get past the segfaults I was having by adding: ulimit -s unlimited Now I have run into https://github.com/NOAA-EMC/WW3/issues/442

ricampos commented 3 years ago

Thanks, Jessica. I will leave a note for me to remember to add this line.

JessicaMeixner-NOAA commented 3 years ago

Okay at this point I have a branch that runs everything on orion except for the netcdf output with the partitions, those tests still fail.

JessicaMeixner-NOAA commented 3 years ago

@aliabdolali @ricampos should I go ahead and make a PR with the updates as of now or wait until we have a fix for the netcdf issues on orion?

aliabdolali commented 3 years ago

@JessicaMeixner-NOAA Thanks, please go ahead and make the PR. If needed, please make an issue associated with this problem.

ricampos commented 3 years ago

Hi Jessica, I found the problem on Orion. When ww3_ounf is compiled with netcdf/4.7.4 the program crashes during partition writing with the message "NetCDF: Name contains illegal characters" as you saw. It partially writes the file (without partitions) and then stop, but the problematic netcdf file is created. When I recompiled the model with netcdf/4.7.2 , ww3_ounf worked nicely. All good. See results at: /work/noaa/marine/ricardo.campos/models/WW3/regtests/ww3_ufs1.3/output I compared the partition characters and text, with the non-partition variables. And I tried to edit w3ounfmetamd, but I didn't manage to make it work with netcdf/4.7.4 . Only with netcdf/4.7.2.

ricampos commented 3 years ago

From now on I will always use module load netcdf/4.7.2 in my jobscripts.

JessicaMeixner-NOAA commented 3 years ago

There was an issue when running the regtests on hera, I thought I had solved that problem, but I guess not. So no pull request yet for this branch.

@ricampos while netcdf/4.7.2 solving the problem is great, that's not an hpc-stack module which is what we want to use. Let's make a new issue for just the netcdf problem problem on orion, using the hpc-stack modules instead. If needed we might need to create a simple test case that we can post on an issue on hpc-stack itself if need be.

ricampos commented 3 years ago

Understood. But what if this is a netcdf/4.7.4 issue instead of a WW3 issue?

JessicaMeixner-NOAA commented 3 years ago

Understood. But what if this is a netcdf/4.7.4 issue instead of a WW3 issue?

It works with netcdf/4.7.4 on hera I'll make a new issue -- let's continue this conversation there.

ricampos commented 3 years ago

ok