PIO: FATAL ERROR with ne1024 during restart write -- idenitified as issue due to memory leak

ndkeen commented 4 years ago

Using a scream repo of Oct 30th, I have encountered 2 errors with ne1024 using a setup that is similar to previous runs (but does include changes to source, etc). The first error was a SIGTERM that occurred at timestep 197 and did not include much else useful to diagnosing the issue. It did write a complete a set of restart files. I restarted from those and the simulation ran further (to step 384), but failed with this error, which appears to be during restart writing. This error has a clear stack trace, so documenting here per @jayeshkrishna suggestion.

272: PIO: FATAL ERROR: Aborting... An error occured, Writing multiple variables to file (f.ne1024pg2tri.s33-oct30.F10DY2.n1536p12288t08xX11181.ne03.snow.12h.wr.st096M1.eam.r.2020-01-20-28800.\
nc, ncid=90) failed. Out of memory (Trying to allocate 1610612736 bytes for rearranged data for multiple variables with the same decomposition). err=-61. Aborting since the error handler was se\
t to PIO_INTERNAL_ERROR... (/global/cscratch1/sd/ndk/wacmy/s33-oct30/externals/scorpio/src/clib/pio_darray.c: 281)
  272: Obtained 10 stack frames.
  272: [0x217ed04]
  272: [0x21814c0]
  272: [0x21bbac2]
  272: [0x21c5464]
  272: [0x217629a]
  272: [0x20e042e]
  272: [0xa34685]
  272: [0x543c03]
  272: [0x4ffc24]
  272: [0x4eba5e]
  272: Rank 272 [Sun Nov  1 05:39:30 2020] [c2-1c0s11n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 272
  272: forrtl: error (76): Abort trap signal
  272: Image              PC                Routine            Line        Source
  272: e3sm.exe           0000000003739294  Unknown               Unknown  Unknown
  272: e3sm.exe           00000000033AB2B0  Unknown               Unknown  Unknown
  272: e3sm.exe           00000000037FEA50  Unknown               Unknown  Unknown
  272: e3sm.exe           0000000003C50421  Unknown               Unknown  Unknown
  272: e3sm.exe           00000000034DFD12  Unknown               Unknown  Unknown
  272: e3sm.exe           00000000034AD86B  Unknown               Unknown  Unknown
  272: e3sm.exe           000000000217ED7E  Unknown               Unknown  Unknown
  272: e3sm.exe           00000000021814C0  Unknown               Unknown  Unknown
  272: e3sm.exe           00000000021BBAC2  Unknown               Unknown  Unknown
  272: e3sm.exe           00000000021C5464  Unknown               Unknown  Unknown
  272: e3sm.exe           000000000217629A  Unknown               Unknown  Unknown
  272: e3sm.exe           00000000020E042E  piolib_mod_mp_fre         983  piolib_mod.F90
  272: e3sm.exe           0000000000A34685  restart_dynamics_         309  restart_dynamics.F90
  272: e3sm.exe           0000000000543C03  cam_restart_mp_ca         240  cam_restart.F90
  272: e3sm.exe           00000000004FFC24  cam_comp_mp_cam_r         389  cam_comp.F90
  272: e3sm.exe           00000000004EBA5E  atm_comp_mct_mp_a         565  atm_comp_mct.F90
  272: e3sm.exe           0000000000422DE9  component_mod_mp_         737  component_mod.F90
  272: e3sm.exe           000000000040443B  cime_comp_mod_mp_        2823  cime_comp_mod.F90
  272: e3sm.exe           00000000004229D3  MAIN__                    133  cime_driver.F90
  272: e3sm.exe           0000000000401F72  Unknown               Unknown  Unknown
  272: e3sm.exe           0000000003C469CF  Unknown               Unknown  Unknown
  272: e3sm.exe           0000000000401E5A  Unknown               Unknown  Unknown

/global/cscratch1/sd/ndk/e3sm_scratch/cori-knl/s33-oct30/f.ne1024pg2tri.s33-oct30.F10DY2.n1536p12288t08xX11181.ne03.snow.12h.wr.st096M1

jayeshkrishna commented 4 years ago

@ndkeen : Do you also have a confluence page (or script) that includes all the settings (code, case specific settings etc) that you use for the run?

(Scorpio is running out of memory here)

dqwu commented 4 years ago

@ndkeen Could you please try to rerun that case with 3072 KNL nodes (reduce MPI tasks per node such that each task has doubled memory to use)? Also, what is the value of PIO_BUFFER_SIZE_LIMIT for your case? The default value is 64 MB. Did you use "xmlchange PIO_BUFFER_SIZE_LIMIT=XXXX" to use a larger size instead? If so, you can try the default 64 MB when you use only 1536 KNL nodes.

dqwu commented 4 years ago

@ndkeen There is a pending feature branch to be merged to SCORPIO, which improves the load balancing of BOX rearranger. You can try that branch for your case if possible (this also helps us to test it) cd externals/scorpio git checkout dqwu/fix_box_rearr

ndkeen commented 4 years ago

Yesterday I made a run with memory measurements and we see the run is clearly running out of memory which likely explains this and other issues. It is also clear that the memory is increasing during the simulation, which suggests a memory leak. Note that with previous attempts using same configuration (ie same script and PE layout, just different source code), we were able to run for 1 day, which is why we did not expect a memory issue here. I will continue debugging this issue and report back, but it doesn't look like PIO issue afterall.

There is a confluence page describing the run scripts, but I don't think they have been finalized. It is essentially the same script we have been using.

I typically set PIO_BUFFER_SIZE_LIMIT to be 128M (as in this case that failed). I will also use 64M in the past but have not noticed any performance difference.

Thanks for suggestion of using improved load balancing rearranger -- I would certainly like to try this, but may have to wait until it's in master.

PeterCaldwell commented 3 years ago

Did we conclude that this was due to the aerosol optics memory leak? If so, can we close this issue? No need to close it if it might still be useful, but I am trying to reduce us below 78 open issues...

ndkeen commented 3 years ago

Yes, this was ultimately due to the memory leak and can be closed. I can't immediately find the PR to include here, but it's still a) not in E3SM master and b) not in SCREAM....

PeterCaldwell commented 3 years ago

Fixed by E3SM #3932 . Fixed in dyamond2 branch by #732.

E3SM-Project / scream

PIO: FATAL ERROR with ne1024 during restart write -- idenitified as issue due to memory leak #713