Closed YanshunLi-washu closed 5 months ago
Hi @YanshunLi-washu, here are few things to try:
WRITE_RESTART_BY_OSERVER
option in GCHP.rc
. This should automatically get set to YES by setCommonRunSettings.sh for high counts. To manually set it you will need to disable that auto-update in the file by commenting out this section of the file.I discussed the issue with MAPL developers earlier today and they have not seen this issue. They also have access to Pleiades and could try to reproduce. Get back to me about the above suggestions and we might transfer this issue over to the GEOS-ESM/MAPL github for them to look into further.
I also learned a MAPL trick today in which you can run GCHP without actually stepping through time. This is useful for debugging restart write issues since you can bypass model computation. To do this takes a bit of work due to turning off auto-updates in GCHP. These are the steps:
gcchem_internal_checkpoint
is deleted. If it is deleted then you should comment out that section of the script.JOB_SGMT
to 00000000 000000
in file CAP.rc
Hi @lizziel thank you so much for these nice suggestions. Pleiades was a bit busy, haven't got a good opportunity to work on these. I'll try later and get back to you.
Hi @lizziel I set the WRITE_RESTART_BY_OSERVER as NO and it worked! Thanks for the suggestion!
Excellent! I wonder if using that setting for high core count is no longer valid for the newer MAPL we introduced in 14.0. I will investigate that and update the model config and docs accordingly.
@tclune, this issue we discussed last week is resolved by turning off the restart write o-server. It was on by default for high core counts. It used to work with our older version MAPL (2.23 I think, possibly earlier) but seems to no longer be needed / actually makes the restart write hang. Is this expected behavior?
OK - would still like @nasa-ben to look at this. Just so we understand, and maybe want to repair.
@lizziel @YanshunLi-washu There should not be any reason that the original (non-oserver) code path should hit that wall and run out of memory at a paltry 1000 cores, all the code does is MPI gathers. I can take a look, don't have any theories now. It's NAS and I assume this is using MPT? I know MPT has some oddities that we have found and can be an adventure but for the time being sounds like the o-server path gets you past your issue so you can run. We have found that in general one code path or the other performs better depending on the MPI stack, the out of memory in that particular code path though is a new one. Out of curiosity, how big in this GEOSchem checkpoint at c360?
Hi @bena-nasa one checkpoint file at c360 takes 133G. Not quite familiar with MPT, @lizziel may have more insights!
Hmm, that's not small. But clearly a single node has enough memory to fit it in since the write-by-oserver path worked. And the fact it worked with fewer cores is more puzzling but almost certainly an MPI issue as these things often are.
It turns out this is a known issue from Liam that I forgot about: https://github.com/geoschem/GCHP/issues/117. See also this issue which is why we added using the o-server for high core counts in the first place: https://github.com/GEOS-ESM/MAPL/issues/548 Based on the issue history it should only need to be turned on if using OpenMPI (in theory, not tested), and makes GCHP fail during checkpoint write on Pleiades. @bena-nasa, this all seems a bit strange. If you could look into whether there is a bug with the o-server that would be helpful.
I will update RTD and comments in config files to be more clear about this issue and the resolution.
Name: Yanshun Li Institution: Washu
Dear Support Team,
I'm recently running GCHP 14.3.1 on the NASA pleiades cluster at C360 resolution for a global simulation.
The model ran well with an average throughput of 3.5 when using 504 cores (21 nodes x 24 cores/node).
However when I increase the number of cores to 1200 (50 nodes x 24 cores/node), the model stopped when writing the first checkpoint file. I encounter the same issue for several times, the program stopped right at the line writing the first checkpoint file. Error message got from email is listed as below:
"Your Pleiades job 19364662.pbspl1.nas.nasa.gov terminated due to one or more nodes running out of memory. Node r515i2n3 ran out of memory and rebooted; others may have run out of memory as well."
No other error outputs in the log. Last a few lines in the log:
"Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4 Using parallel NetCDF for file: Restarts/gcchem_internal_checkpoint.20211002_1200z.nc4"
Relevant files in the run directory is attached. gchp_debug.zip
As far as I know, my college running GCHP 13.4 & 13.2 didn't met similar bugs.
Based on the above info, could you kindly help to take a look on this issue?
Thanks, Yanshun