Closed crjones-amath closed 5 years ago
Here's a similar error from my ne120 attempt (with intel) for reference:
A few more details:
More useful e3sm.log output:
1848: pio_support::pio_die:: myrank= -1 : ERROR: nf_mod.F90: 1508 : 1848: NetCDF: One or more variable sizes violate format constraints
Any large arrays are added to restart file for running ne120+SP1?
Non-SP ne120 runs ok with PIO_TYPENAME=pnetcdf and PIO_NETCDF_FORMAT=64bit_offset.
Netcdf format of 64bit_data would be able to overcome issue of large array size of this scale (like for ne240 on this page)
Give a try as follows,
./xmlchange ATM_PIO_NETCDF_FORMAT="64bit_data"
I am not sure if a change to env_run.xml will take effect in your code base. If not, a change can be made to cam_pio_utils.F90 to make it work tentatively.
Thanks @wlin7, I'll give it a shot! We are definitely writing large arrays -- specifically, CRM level 2D fields of dimension (crm_nx, crm_nz) in each GCM column, with typical values on the order of (32, 58).
@wlin7 I changed ATM_PIO_NETCDF_FORMAT to 64bit_data in env_run.xml using /xmlchange ATM_PIO_NETCDF_FORMAT="64bit_data"
but unfortunately it still fails in the same way. Any other suggestions for what to test?
Thank you for testing, @crjones-amath . I also suspect that the setting of 64bit_data not taking effect, so I have not tried that path for a long time. Instead, I always used my modified version of cam_pio_utils.F90 (spawned off the early exploration of ne240 run). I put the file on NERSC : /global/cscratch1/sd/wlin/share/cam_pio_utils.F90
You may put the file under SourceMods/src.cam, and re-do ./case.build. Hope it works for you. Eventually, we need to figure out why setting it in env_run.xml does not work as intended.
A side note: increasing the array size for for global variable by another crm_nx*crm_ny folds definitely kills it. When I dealt with this problem for ne240, the increase was only 4 folds. I am eager to know if this change in cam_pio_utils.F90 alone would work for you,
based on this page:
you can see the 3 different file formats we can write with pnetcdf. I believe the 2nd is the default (CDF-2), for which the file can be > 2GB, but each individual record must be <= 2GB.
Can you verify that the array that restart wants to write is > 2GB
From the comments above, it sounds like there is a bug when trying to use CDF-5 format. pinging @rljacob and @jayeshkrishna to see if they know about this bug.
@mt5555 Thanks for pointing me to that reference page. I'm almost certain that the CRM-level arrays needing to be written are larger than 2GB: array size is at least (crm_nx crm_nz number_of_gll_nodes). In my case, that's (32 58 777602) = 1,443,229,312 elements. At double precision, that should be, what, about 11.5 GB?
I will test @wlin7's mods to cam_pio_utils.F90 and report back.
Update: using the modifications to cam_pio_utils.F90 provided by @wlin7 solves the restart-writing problem (!).
I haven't verified that the model can read from these restart files yet, but will test that shortly. For what it's worth, at ne120 using 32 CRM columns, the cam restart file sizes are:
129G --- ne120_SP1_72L_32x1_1km_nc720.cam.r.2000-01-01-01200.nc 40G --- ne120_SP1_72L_32x1_1km_nc720.cam.rh0.2000-01-01-01200.nc 3G --- ne120_SP1_72L_32x1_1km_nc720.cam.rh1.2000-01-01-01200.nc 1.1G --- ne120_SP1_72L_32x1_1km_nc720.cam.rh2.2000-01-01-01200.nc 47M --- ne120_SP1_72L_32x1_1km_nc720.cam.rh3.2000-01-01-01200.nc
Great you got it to work, @crjones-amath . We have every reason to expect the restart reading should also be fine. FYI, the cam.r file size for ne240 is nearly 430G.
ne120 SP1 simulations crash when writing restart files.
Example output from e3sm.log from a recent run on titan: