GCHP 14.3.1 out of memory when writing checkpoint files

YanshunLi-washu commented 6 months ago

Name: Yanshun Li Institution: Washu

Dear Support Team,

I'm recently running GCHP 14.3.1 on the NASA pleiades cluster at C360 resolution for a global simulation.

The model ran well with an average throughput of 3.5 when using 504 cores (21 nodes x 24 cores/node).

However when I increase the number of cores to 1200 (50 nodes x 24 cores/node), the model stopped when writing the first checkpoint file. I encounter the same issue for several times, the program stopped right at the line writing the first checkpoint file. Error message got from email is listed as below:

"Your Pleiades job 19364662.pbspl1.nas.nasa.gov terminated due to one or more nodes running out of memory. Node r515i2n3 ran out of memory and rebooted; others may have run out of memory as well."

No other error outputs in the log. Last a few lines in the log:

"Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4 Using parallel NetCDF for file: Restarts/gcchem_internal_checkpoint.20211002_1200z.nc4"

Relevant files in the run directory is attached. gchp_debug.zip

As far as I know, my college running GCHP 13.4 & 13.2 didn't met similar bugs.

Based on the above info, could you kindly help to take a look on this issue?

Thanks, Yanshun

lizziel commented 6 months ago

Hi @YanshunLi-washu, here are few things to try:

Toggle the WRITE_RESTART_BY_OSERVER option in GCHP.rc. This should automatically get set to YES by setCommonRunSettings.sh for high counts. To manually set it you will need to disable that auto-update in the file by commenting out this section of the file.
Try a short run and only output the final restart file. Do you see the same problem?

I discussed the issue with MAPL developers earlier today and they have not seen this issue. They also have access to Pleiades and could try to reproduce. Get back to me about the above suggestions and we might transfer this issue over to the GEOS-ESM/MAPL github for them to look into further.

lizziel commented 6 months ago

I also learned a MAPL trick today in which you can run GCHP without actually stepping through time. This is useful for debugging restart write issues since you can bypass model computation. To do this takes a bit of work due to turning off auto-updates in GCHP. These are the steps:

Comment out this line in setCommonRunSettings.sh
Check your GCHP run script to see if output file gcchem_internal_checkpoint is deleted. If it is deleted then you should comment out that section of the script.
Set JOB_SGMT to 00000000 000000 in file CAP.rc

YanshunLi-washu commented 6 months ago

Hi @lizziel thank you so much for these nice suggestions. Pleiades was a bit busy, haven't got a good opportunity to work on these. I'll try later and get back to you.

YanshunLi-washu commented 6 months ago

Hi @lizziel I set the WRITE_RESTART_BY_OSERVER as NO and it worked! Thanks for the suggestion!

lizziel commented 6 months ago

Excellent! I wonder if using that setting for high core count is no longer valid for the newer MAPL we introduced in 14.0. I will investigate that and update the model config and docs accordingly.

lizziel commented 6 months ago

@tclune, this issue we discussed last week is resolved by turning off the restart write o-server. It was on by default for high core counts. It used to work with our older version MAPL (2.23 I think, possibly earlier) but seems to no longer be needed / actually makes the restart write hang. Is this expected behavior?

tclune commented 6 months ago

OK - would still like @nasa-ben to look at this. Just so we understand, and maybe want to repair.

bena-nasa commented 6 months ago

@lizziel @YanshunLi-washu There should not be any reason that the original (non-oserver) code path should hit that wall and run out of memory at a paltry 1000 cores, all the code does is MPI gathers. I can take a look, don't have any theories now. It's NAS and I assume this is using MPT? I know MPT has some oddities that we have found and can be an adventure but for the time being sounds like the o-server path gets you past your issue so you can run. We have found that in general one code path or the other performs better depending on the MPI stack, the out of memory in that particular code path though is a new one. Out of curiosity, how big in this GEOSchem checkpoint at c360?

YanshunLi-washu commented 6 months ago

Hi @bena-nasa one checkpoint file at c360 takes 133G. Not quite familiar with MPT, @lizziel may have more insights!

bena-nasa commented 6 months ago

Hmm, that's not small. But clearly a single node has enough memory to fit it in since the write-by-oserver path worked. And the fact it worked with fewer cores is more puzzling but almost certainly an MPI issue as these things often are.

lizziel commented 6 months ago

It turns out this is a known issue from Liam that I forgot about: https://github.com/geoschem/GCHP/issues/117. See also this issue which is why we added using the o-server for high core counts in the first place: https://github.com/GEOS-ESM/MAPL/issues/548 Based on the issue history it should only need to be turned on if using OpenMPI (in theory, not tested), and makes GCHP fail during checkpoint write on Pleiades. @bena-nasa, this all seems a bit strange. If you could look into whether there is a bug with the o-server that would be helpful.

lizziel commented 5 months ago

I will update RTD and comments in config files to be more clear about this issue and the resolution.

geoschem / GCHP

GCHP 14.3.1 out of memory when writing checkpoint files #413