GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
27 stars 17 forks source link

Rome and 15-minute History Collection: Logging in IOServer? #819

Open mathomp4 opened 3 years ago

mathomp4 commented 3 years ago

This is a query for @weiyuan-jiang with a bit of pinging of @tclune as it's pFlogger related. (And maybe the rest of @GEOS-ESM/mapl-team ...)

The issue is something I've been seeing at NAS. I was able to run a C720 job at 15x360 for 1 day with the "regular" AGCM history on the AMD Rome nodes using:

--oserver_type multigroup --nodes_output_server 3 --npes_backend_pernode 20

on a suggestion of @weiyuan-jiang

But, as a test, I decided to do something like Dimitris has been running (@atrayano knows about this) which is a long job with 15-minute collections. So I took geosgcm_prog and copied it to make an additional geosgcm_prog_15mn collection which outputs at 15 minutes.

I then tried to run a 5-model-day run of this with the 15 minute collection using:

--oserver_type multigroup --nodes_output_server 3 --npes_backend_pernode 20

(same as above) and the job crashed about a model-day (at 2015/04/15 21z-ish) in a "silent" way. NAS Support said that it looked like a node rebooted.

So, I resubmitted with 6 nodes, 20 backend:

--oserver_type multigroup --nodes_output_server 6 --npes_backend_pernode 20

This time it ran for about 4 model days and then crashed at 2015/04/18 22z-ish. Per Johnny Chang of NAS:

That node, r202c4t6n2, crashed and rebooted around 21:53:52. There are no indications for why the node crashed. It is the second to last node for your job. If that is an I/O node and there is a high output pressure that fills up memory in the form of buffer cache, then that can cause the node to run out of memory. Again, this is just a guess. There are no clues for why the node crashed.

Maybe adding the 15-minute collections might be causing a "pile up" on the IOserver nodes? So, what I wondered is: is there a way to get pflogger to write out something like:

File stock-gcm-2021Apr22-1day-c720-15x360-ROME.geosgcm_prog_15mn.20150415_2000z.nc4 started
...
File stock-gcm-2021Apr22-1day-c720-15x360-ROME.geosgcm_prog_15mn.20150415_2000z.nc4 finished

in a log file somewhere? It's hard to debug the io nodes, so maybe they could tell us what's happening. Then we could search the file and look to see if a whole lot of files have started writing but haven't finished?

Then again, if I look, I do see:

stock-gcm-2021Apr22-1day-c720-15x360-ROME.geosgcm_prog_15mn.20150418_2015z.nc4

in the scratch directory and it has actual data! So... 🤷🏼‍♂️

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

mathomp4 commented 3 years ago

@tclune @weiyuan-jiang Do you think this is possible with logger?

tclune commented 3 years ago

It should be straightforward to use pflogger in that layer. Might be a bit harder to only emit the message on one process.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

stale[bot] commented 2 years ago

Closing due to inactivity

mathomp4 commented 2 years ago

Re-opening as I think this is still an issue. I'll mark longterm

mathomp4 commented 2 years ago

Probably need to see if @weiyuan-jiang sees this as well at NAS