ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
305 stars 308 forks source link

Failure to create lnd_mesh.nc file from global 0.01x0.01 deg SCRIP file #2479

Open olyson opened 5 months ago

olyson commented 5 months ago

Brief summary of bug

Failure to create lnd_mesh.nc file from global 0.01x0.01 deg SCRIP file

General bug information

CTSM version you are using: NA

Does this bug cause significantly incorrect results in the model's science? No

Details of bug

As part of the steps required to create a global 0.01x0.01 surface dataset, one step that fails on Derecho is:

/glade/u/apps/derecho/23.09/spack/opt/spack/esmf/8.6.0/cray-mpich/8.1.27/oneapi/2023.2.1/7haa/bin/ESMF_Scrip2Unstruct /glade/work/oleson/release-cesm2.2.0/components/clm/tools/mkmapgrids/SCRIPgrid_36000x18000pt_Global_nomask_c240418.nc lnd_mesh.nc 0

This fails immediately with "killed"

Tried:

qcmd -- /glade/u/apps/derecho/23.09/spack/opt/spack/esmf/8.6.0/cray-mpich/8.1.27/oneapi/2023.2.1/7haa/bin/ESMF_Scrip2Unstruct /glade/work/oleson/release-cesm2.2.0/components/clm/tools/mkmapgrids/SCRIPgrid_36000x18000pt_Global_nomask_c240418.nc lnd_mesh.nc 0

For reference, the qcmd specifics are: qsub -l select=1:ncpus=32:mem=55GB -A P93300041 -q develop@desched1 -l walltime=01:00:00

This fails with:

Segmentation fault (core dumped)

Tried:

qsub -l select=1:ncpus=1:mem=235GB -q develop -A P93300641 -l walltime=01:00:00 -- /glade/u/apps/derecho/23.09/spack/opt/spack/esmf/8.6.0/cray-mpich/8.1.27/oneapi/2023.2.1/7hhaa/bin/ESMF_Scrip2Unstruct /glade/work/oleson/release-cesm2.2.0/components/clm/tools/mkmapgrids/SCRIPgrid_36000x18000pt_Global_nomask_c240418.nc lnd_mesh.nc 0

This fails with:

/glade/u/apps/derecho/23.09/spack/opt/spack/esmf/8.6.0/cray-mpich/8.1.27/oneapi/2023.2.1/7haa/bin/ESMF_Scrip2Unstruct: error while loading shared libraries: libfabric.so.1: cannot open shared object file: No such file or directory

I've tried some other combinations of approaches on both Derecho and Casper and none have worked.

I'll note that the creation of a 0.05x0.05 deg global lnd_mesh file does work, and the rest of the process to create a 0.05x0.05 surface dataset does work.

ekluzek commented 5 months ago

I have one thing to try @olyson. It looks like ESMF_Scrip2Unstruct IS a parallel program. So if you setup batch for it and give it multiple processors (and use a version of ESMF that uses a full MPI library), this likely WILL work.

The version you have above did link in a full MPI library. So you should just be able to modify the batch commands to use multiple processors, and to add "mpibind" in front of the call to ESMF_Scrip2Unstruct.

If it doesn't we point ESMF people to it. But, also 0.05 degree's seems acceptable for most uses...

olyson commented 5 months ago

Thanks for the suggestion @ekluzek . I made a batch script which works for 0.1x0.1 and 0.05x0.05, but 0.01 fails. I've tried various combinations of nodes, cores, memory. No errors in either the PET logs or in stderr/stdout most of the time. Occasionally I get an error like this in stdout:

dec1296.hsn.de.hpc.ucar.edu: rank -1 died from signal 9 dec1281.hsn.de.hpc.ucar.edu: rank 1 died from signal 15

On a side note, I use mkscripgrid.ncl to create the SCRIP file. I had to add the following to increase the file size:

Opt@LargeFile = True Opt@NetCDFType = "netcdf4"

which creates a netcdf4 file. I tried to use nccopy to convert it to cdf5 but got this error:

NetCDF: Not a valid data type or _FillValue type mismatch

I eventually found that nccopy did not like the fact that "string" was prepended to the global attributes in the file, e.g.,

string :Createdby = "ESMF_regridding.ncl

I deleted the global attributes and then nccopy worked to create a cdf5 file.

For future reference, my script is here:

/glade/work/oleson/release-cesm2.1.3/components/clm/tools/mkmapgrids/submit_Scrip2Unstruct_derecho.csh

ekluzek commented 5 months ago

OK, I'm going to point the ESMF people to this.