Closed ndkeen closed 3 years ago
@ndkeen : Do you also have a confluence page (or script) that includes all the settings (code, case specific settings etc) that you use for the run?
(Scorpio is running out of memory here)
@ndkeen Could you please try to rerun that case with 3072 KNL nodes (reduce MPI tasks per node such that each task has doubled memory to use)? Also, what is the value of PIO_BUFFER_SIZE_LIMIT for your case? The default value is 64 MB. Did you use "xmlchange PIO_BUFFER_SIZE_LIMIT=XXXX" to use a larger size instead? If so, you can try the default 64 MB when you use only 1536 KNL nodes.
@ndkeen There is a pending feature branch to be merged to SCORPIO, which improves the load balancing of BOX rearranger. You can try that branch for your case if possible (this also helps us to test it) cd externals/scorpio git checkout dqwu/fix_box_rearr
Yesterday I made a run with memory measurements and we see the run is clearly running out of memory which likely explains this and other issues. It is also clear that the memory is increasing during the simulation, which suggests a memory leak. Note that with previous attempts using same configuration (ie same script and PE layout, just different source code), we were able to run for 1 day, which is why we did not expect a memory issue here. I will continue debugging this issue and report back, but it doesn't look like PIO issue afterall.
There is a confluence page describing the run scripts, but I don't think they have been finalized. It is essentially the same script we have been using.
I typically set PIO_BUFFER_SIZE_LIMIT to be 128M (as in this case that failed). I will also use 64M in the past but have not noticed any performance difference.
Thanks for suggestion of using improved load balancing rearranger -- I would certainly like to try this, but may have to wait until it's in master.
Did we conclude that this was due to the aerosol optics memory leak? If so, can we close this issue? No need to close it if it might still be useful, but I am trying to reduce us below 78 open issues...
Yes, this was ultimately due to the memory leak and can be closed. I can't immediately find the PR to include here, but it's still a) not in E3SM master and b) not in SCREAM....
Fixed by E3SM #3932 . Fixed in dyamond2 branch by #732.
Using a scream repo of Oct 30th, I have encountered 2 errors with ne1024 using a setup that is similar to previous runs (but does include changes to source, etc). The first error was a SIGTERM that occurred at timestep 197 and did not include much else useful to diagnosing the issue. It did write a complete a set of restart files. I restarted from those and the simulation ran further (to step 384), but failed with this error, which appears to be during restart writing. This error has a clear stack trace, so documenting here per @jayeshkrishna suggestion.