E3SM-Project / ACME-ECP

E3SM MMF for DoE ECP project
Other
9 stars 1 forks source link

ne120 crashes when writing restart files #72

Closed crjones-amath closed 5 years ago

crjones-amath commented 5 years ago

ne120 SP1 simulations crash when writing restart files.

Example output from e3sm.log from a recent run on titan:

forrtl: error (76): Abort trap signal Image PC Routine Line Source e3sm.exe 00000000029B4C7E Unknown Unknown Unknown libpthread-2.11.3 00002AAAB1756850 Unknown Unknown Unknown libc-2.11.3.so 00002AAAB1C0F875 gsignal Unknown Unknown libc-2.11.3.so 00002AAAB1C10E51 abort Unknown Unknown libmpich_intel.so 00002AAAB07802C8 Unknown Unknown Unknown libmpich_intel.so 00002AAAB068EF66 MPI_Abort Unknown Unknown libmpich_intel.so 00002AAAB06D6BA5 mpi_abort Unknown Unknown e3sm.exe 00000000025D1AE4 pio_support_mp_pi 124 pio_support.F90 e3sm.exe 00000000025CFAD5 pio_utils_mp_chec 59 pio_utils.F90 e3sm.exe 00000000025B7D6E nf_mod_mp_pio_end 1508 nf_mod.F90 e3sm.exe 0000000000548DA8 cam_restart_mp_ca 230 cam_restart.F90 e3sm.exe 00000000004FE6E9 cam_comp_mp_cam_r 396 cam_comp.F90 e3sm.exe 00000000004E5D02 atm_comp_mct_mp_a 509 atm_comp_mct.F90 e3sm.exe 0000000000431ECA component_modmp 728 component_mod.F90 e3sm.exe 0000000000418D5F cime_comp_modmp 3386 cime_comp_mod.F90 e3sm.exe 0000000000431BEF MAIN__ 103 cime_driver.F90

whannah1 commented 5 years ago

Here's a similar error from my ne120 attempt (with intel) for reference: image

crjones-amath commented 5 years ago

A few more details:

More useful e3sm.log output:

1848: pio_support::pio_die:: myrank= -1 : ERROR: nf_mod.F90: 1508 : 1848: NetCDF: One or more variable sizes violate format constraints

wlin7 commented 5 years ago

Any large arrays are added to restart file for running ne120+SP1?

Non-SP ne120 runs ok with PIO_TYPENAME=pnetcdf and PIO_NETCDF_FORMAT=64bit_offset.

Netcdf format of 64bit_data would be able to overcome issue of large array size of this scale (like for ne240 on this page)

Give a try as follows,

./xmlchange ATM_PIO_NETCDF_FORMAT="64bit_data"

I am not sure if a change to env_run.xml will take effect in your code base. If not, a change can be made to cam_pio_utils.F90 to make it work tentatively.

crjones-amath commented 5 years ago

Thanks @wlin7, I'll give it a shot! We are definitely writing large arrays -- specifically, CRM level 2D fields of dimension (crm_nx, crm_nz) in each GCM column, with typical values on the order of (32, 58).

crjones-amath commented 5 years ago

@wlin7 I changed ATM_PIO_NETCDF_FORMAT to 64bit_data in env_run.xml using /xmlchange ATM_PIO_NETCDF_FORMAT="64bit_data"

but unfortunately it still fails in the same way. Any other suggestions for what to test?

wlin7 commented 5 years ago

Thank you for testing, @crjones-amath . I also suspect that the setting of 64bit_data not taking effect, so I have not tried that path for a long time. Instead, I always used my modified version of cam_pio_utils.F90 (spawned off the early exploration of ne240 run). I put the file on NERSC : /global/cscratch1/sd/wlin/share/cam_pio_utils.F90

You may put the file under SourceMods/src.cam, and re-do ./case.build. Hope it works for you. Eventually, we need to figure out why setting it in env_run.xml does not work as intended.

wlin7 commented 5 years ago

A side note: increasing the array size for for global variable by another crm_nx*crm_ny folds definitely kills it. When I dealt with this problem for ne240, the increase was only 4 folds. I am eager to know if this change in cam_pio_utils.F90 alone would work for you,

mt5555 commented 5 years ago

based on this page:

https://acme-climate.atlassian.net/wiki/spaces/EIDMG/pages/769130507/Picking+a+netcdf+type+for+all+input+files

you can see the 3 different file formats we can write with pnetcdf. I believe the 2nd is the default (CDF-2), for which the file can be > 2GB, but each individual record must be <= 2GB.

Can you verify that the array that restart wants to write is > 2GB

From the comments above, it sounds like there is a bug when trying to use CDF-5 format. pinging @rljacob and @jayeshkrishna to see if they know about this bug.

crjones-amath commented 5 years ago

@mt5555 Thanks for pointing me to that reference page. I'm almost certain that the CRM-level arrays needing to be written are larger than 2GB: array size is at least (crm_nx crm_nz number_of_gll_nodes). In my case, that's (32 58 777602) = 1,443,229,312 elements. At double precision, that should be, what, about 11.5 GB?

I will test @wlin7's mods to cam_pio_utils.F90 and report back.

crjones-amath commented 5 years ago

Update: using the modifications to cam_pio_utils.F90 provided by @wlin7 solves the restart-writing problem (!).

I haven't verified that the model can read from these restart files yet, but will test that shortly. For what it's worth, at ne120 using 32 CRM columns, the cam restart file sizes are:

129G --- ne120_SP1_72L_32x1_1km_nc720.cam.r.2000-01-01-01200.nc 40G --- ne120_SP1_72L_32x1_1km_nc720.cam.rh0.2000-01-01-01200.nc 3G --- ne120_SP1_72L_32x1_1km_nc720.cam.rh1.2000-01-01-01200.nc 1.1G --- ne120_SP1_72L_32x1_1km_nc720.cam.rh2.2000-01-01-01200.nc 47M --- ne120_SP1_72L_32x1_1km_nc720.cam.rh3.2000-01-01-01200.nc

wlin7 commented 5 years ago

Great you got it to work, @crjones-amath . We have every reason to expect the restart reading should also be fine. FYI, the cam.r file size for ne240 is nearly 430G.