E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
346 stars 353 forks source link

F2010_chemUCI-Linozv3-mam5 compset crash on pm-cpu #5798

Open wagmanbe opened 1 year ago

wagmanbe commented 1 year ago

Case on perlmutter: /global/cfs/cdirs/e3sm/emulate/E3SM_simulations/replicate.F2010.v3atm_on_master.pm-cpu/ Branch: v3atm/eam/master_MAM5_wetaero_chemdyg e3sm log error: 896: PIO: FATAL ERROR: Aborting... An error occured, Writing variables (number of variables = 180) to file (./replicate.F2010.v3atm_on_master.pm-cpu.elm.h0.0001-04.nc, ncid=186) using PIO_IOTYPE_PNETCDF iotype failed. Non blocking write for variable (SNOWDP, varid=157) failed (Number of subarray requests/regions=1, Size of data local to this process = 2700). NetCDF: Numeric conversion not representable (err=-60). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/global/u2/w/wagmanbe/E3SM/code/20230622/externals/scorpio/src/clib/pio_darray_int.c: 395)

This case is my attempt to replicate a simulation on pm-cpu by @wlin7 found here : /pscratch/sd/w/wlin/E3SMv3_dev/20230627.F2010.v3atm_on_master.pm-cpu/. His simulation was meant to be a pm-cpu version of this case: run.20230619b.v3alpha01.F2010.chrysalis.sh, which runs on chrysalis. @wlin7's pm-cpu case ran 30 days before time-out, as intended. I find that if the run goes longer (3 months here, 15 months when I initialize with a different atmosphere), it crashes due to the ELM error above.

wagmanbe commented 1 year ago

Suspect (incorrectly--see subsequent comments) that error was caused by writing simulation output to /cfs instead of /pscratch.

wlin7 commented 1 year ago

Hi @ndkeen , do you have some idea for explaining such behavior, that a case created to run on PSCRATCH is ok, but would fail prematurely (after several months) if running on cfs? I remember when PSCRATCH was on maintenance, you used to create cases to run on cfs.

Note that the issue @wagmanbe reported above was not a one-time thing. He replicated the problems many times, all running on cfs. Benj, were all those runs failed at about the same time? Were they BFB?

wagmanbe commented 1 year ago

To complicate things, I just ran successfully on /cfs for the first time: /global/cfs/cdirs/e3sm/emulate/E3SM_simulations/20230710.replicate.F2010.v3atm_on_master.pm-cpu.8.cfs I'll re-run the failed case mentioned when I opened the issue: /global/cfs/cdirs/e3sm/emulate/E3SM_simulations/replicate.F2010.v3atm_on_master.pm-cpu/ as /global/cfs/cdirs/e3sm/emulate/E3SM_simulations/replicate.F2010.v3atm_on_master.pm-cpu.retry/. If it runs, then maybe it was a machine or cfs issue that has been resolved?

wlin7 commented 1 year ago

Thanks for the update, @wagmanbe . Good to see you can run it successfully now. Not sure if there was indeed a machine/cfs issue that was quietly resolved. Please feel free to put this aside if you do not plan to regularly run jobs on cfs.

wlin7 commented 1 year ago

@wagmanbe , unfortunately we still need to deal with this issue, and it probably has nothing to do with cfs.

I extended my F2010 simulation /pscratch/sd/w/wlin/E3SMv3_dev/20230627.F2010.v3atm_on_master.pm-cpu, trying to have 10 years. It failed in the same way as you reported during the 4th year, about SNOWDP conversion, assumingly having an invalid value. Such unpredictable behavior would be very cumbersome for your PPE simulations.

The same 5-year run is ok on chrysalis. If your pm-cpu run continues to have problem, try pm-cpu_intel. By default, pm-cpu_gnu is used.

(FYI: Error log from my run: /pscratch/sd/w/wlin/E3SMv3_dev/20230627.F2010.v3atm_on_master.pm-cpu/run/e3sm.log.11514244.230711-225702)

wagmanbe commented 1 year ago

@wlin7, thank you for confirming the error on pm-cpu on /pscratch. I will try pm-cpu_intel. I hope we can get to the bottom of this soon.

ndkeen commented 1 year ago

I think you are correct, Wuyin -- it should not matter which filesystem is being written to. There may be some performance differences or different types of rare errors between the two, but the above error does not look like an issue that would result from writing to CFS alone.

wagmanbe commented 1 year ago

I also replicated the SNOWDP error on pm-cpu writing to /pscratch with gnu compiler. I was not able to set up a case using the intel compiler but will keep working on that. Here is the SNOWDP error, again. This is holding up the autotuning simulations so hopefully we can find a fix soon:

PIO: FATAL ERROR: Aborting... An error occured, Writing variables (number of variables = 180) to file (./20230710.v3alpha01.F2010.pmcpu.8N.elm.h0.0002-04.nc, ncid=219) using PIO_IOTYPE_PNETCDF iotype failed. Non blockin g write for variable (SNOWDP, varid=157) failed (Number of subarray requests/regions=1, Size of data local to this process = 2700). NetCDF: Numeric conversion not representable (err=-60). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/global/u2/w/wagmanbe/E3SM/code/20230710/externals/scorpio/src/clib/pio_darray_int.c: 395)

The case is in my /pscratch which is shared to another group, but @wlin7 's case should be just as good for debugging. /pscratch/sd/w/wlin/E3SMv3_dev/20230627.F2010.v3atm_on_master.pm-cpu/run/e3sm.log.11514244.230711-225702

quantheory commented 10 months ago

@wlin7 @wagmanbe Did either of you figure anything else out about this bug? This same SNOWDP write crash is occurring in the coupled run being used to evaluate the gustiness PR.

wlin7 commented 10 months ago

@quantheory , we talked out this issue re-emerged with the gustiness PR on a different machine/compiler. There were no further investigation of this pm-cpu_gnu issue after switching default to intel. Indeed we can revisit this issue as part of the effort to investigate the new crash with the gustiness PR.

quantheory commented 5 months ago

Has anyone seen this crash occur since #6034 was merged? I wonder if we can close this.

wagmanbe commented 5 months ago

I have not seen the crash since then, but I have not run enough simulations to answer whether it can be closed.