JMMP-Group / SEVERN-SWOT

Severn estuary 500m ocean model
MIT License
1 stars 2 forks source link

Model hangs #19

Closed jpolton closed 2 years ago

jpolton commented 3 years ago

Model seems to run but hangs without terminating properly

jpolton commented 3 years ago

Chris passed on insights from Adam. To replace the modules in make_xios.sh make_nemo.sh and the submit.slurm scripts from

module -s restore /work/n01/shared/acc/n01_modules/ucx_env

to

module load cpe/21.03
module load cray-hdf5-parallel
module load cray-netcdf-hdf5parallel
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

In the slurm script this goes above the OMP_NUM_THREADS=1

jpolton commented 3 years ago

Rebuilding nemo.exe and xios_server.exe as above on branch feature/new_module made no difference to the hanging

jpolton commented 3 years ago

Try investigating the allocation of nodes and all that. E.g. from Chris:

/work/n01/n01/cwi/mkslurm_hetjob -S 8 -s 16 -m 2 -C 831 -g 16 -N 128 -t 00:10:00 -a n01-CLASS -j SE-NEMO > runscript_831.slurm

The number of cores and gaps are the main things to vary here. 831 and 16 are a sweet spot for eORCA025 but other options may be better for a different NEMO configuration. Scroll to the bottom of https://docs.archer2.ac.uk/research-software/nemo/nemo.html for info.

jpolton commented 3 years ago

@mpayopayo @micdom IT IS RUNNING!! On a new branch https://github.com/JMMP-Group/SEVERN-SWOT/tree/feature/new_modules I tried a couple of things:

  1. Try out the alternative modules as Chris suggested. This did not work (on its own)
  2. Rebuild the slurm script that does all the fancy MPI allocation stuff.

Together these got NEMO running and outputting again (though "1." above may not be necessary).

It didn't quite complete - possible an issue with the domain but this is progress.

jpolton commented 3 years ago

As a minimum effort to test if it was only the new slurm script that was needed copy (https://github.com/JMMP-Group/SEVERN-SWOT/blob/feature/new_modules/RUN_DIRECTORIES/EXP_unforced/submit.slurm) and swap the jelt line 28. And swap the modules back (line 16 instead of 17-20). And n01-ACCORD in line 6.

Unforrtunately I have rebuilt my NEMO and XIOS executables using the new modules so haven't tested whether they were important or not.

micdom commented 3 years ago

@jpolton @mpayopayo I'm a bit lost. I'm trying to run the unforced run without the boundary file (which I didn't manage to build). I got an output, not sure what I got though... but I didn't change the submit.slurm script...

jpolton commented 3 years ago

@jpolton @mpayopayo I'm a bit lost. I'm trying to run the unforced run without the boundary file (which I didn't manage to build). I got an output, not sure what I got though... but I didn't change the submit.slurm script...

@micdom Can you do a chmod a+rx -R /work/n01/n01/micdom

mpayopayo commented 3 years ago

@jpolton @micdom OK I'll try now just with the new submit slurm

micdom commented 3 years ago

@jpolton @mpayopayo done chmod a+rx -R /work/n01/n01/micdom

mpayopayo commented 3 years ago

@jpolton maybe silly, but I'm not at ease yet with git, do I have to do the test in a new branch?

jpolton commented 3 years ago

@jpolton @mpayopayo I'm a bit lost. I'm trying to run the unforced run without the boundary file (which I didn't manage to build). I got an output, not sure what I got though... but I didn't change the submit.slurm script...

Looks like @micdom is the winner so far. Even got RESTART files written!! The run log is ocean.output . The XIOS output (defined in field_def_nemo-oce.xml) is SEVERN_unforced_1d_t.nc Well done

jpolton commented 3 years ago

@jpolton maybe silly, but I'm not at ease yet with git, do I have to do the test in a new branch?

You could copy off the web page and paste it into your file.

jpolton commented 3 years ago

@micdom Screen_Capture_-_9_Jul__5_37_pm

Elevations ~1e-12 m after 288 steps without forcing. Good job.

Riding high on this success, I'm calling it quits for the week before something goes wrong!

mpayopayo commented 3 years ago

@jpolton I'm getting segmentation fault, maybe because different modules compiling and running? I'm running with the bathy that misses the SW bit. If @micdom is running with the "full" bathy, and I did not have problems with your bathy Could it all come from the bathy and the domain?

I'll try next week generating the bathy again.

micdom commented 3 years ago

@jpolton @mpayopayo I'm using a different bathy with the SW bit! dout.variables['elevation'][0:99,:] = 0 dout.variables['elevation'][0:200,650::] = 0 for the rest I've followed the instructions, made a last pull this afternoon, and just changed in the namelist_cfg ln_bdy=.false. and nn_itend= 288.

have a nice weekend!

micdom commented 3 years ago

@jpolton @mpayopayo I have not updated the wiki for the unforced run, but maybe I should.

The unforced run can be done without creating the boundary file first. It is sufficient to change ln_bdy=.false in the namelist_cfg.

The section of the wiki Run Unforced can go before Make tidal boundary conditions.

mpayopayo commented 3 years ago

@jpolton, @micdom I'm redoing again the bathy and the run unforced, I happy to modify the wiki afterwards

mpayopayo commented 3 years ago

@jpolton it hangs/gives segmentation fault with the crop bathymetry but not with the full bathymetry. So I think that is were the problem is.