Open mathomp4 opened 3 years ago
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.
@tclune @weiyuan-jiang Do you think this is possible with logger?
It should be straightforward to use pflogger in that layer. Might be a bit harder to only emit the message on one process.
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.
Closing due to inactivity
Re-opening as I think this is still an issue. I'll mark longterm
Probably need to see if @weiyuan-jiang sees this as well at NAS
This is a query for @weiyuan-jiang with a bit of pinging of @tclune as it's pFlogger related. (And maybe the rest of @GEOS-ESM/mapl-team ...)
The issue is something I've been seeing at NAS. I was able to run a C720 job at 15x360 for 1 day with the "regular" AGCM history on the AMD Rome nodes using:
on a suggestion of @weiyuan-jiang
But, as a test, I decided to do something like Dimitris has been running (@atrayano knows about this) which is a long job with 15-minute collections. So I took
geosgcm_prog
and copied it to make an additionalgeosgcm_prog_15mn
collection which outputs at 15 minutes.I then tried to run a 5-model-day run of this with the 15 minute collection using:
(same as above) and the job crashed about a model-day (at 2015/04/15 21z-ish) in a "silent" way. NAS Support said that it looked like a node rebooted.
So, I resubmitted with 6 nodes, 20 backend:
This time it ran for about 4 model days and then crashed at 2015/04/18 22z-ish. Per Johnny Chang of NAS:
Maybe adding the 15-minute collections might be causing a "pile up" on the IOserver nodes? So, what I wondered is: is there a way to get pflogger to write out something like:
in a log file somewhere? It's hard to debug the io nodes, so maybe they could tell us what's happening. Then we could search the file and look to see if a whole lot of files have started writing but haven't finished?
Then again, if I look, I do see:
in the
scratch
directory and it has actual data! So... 🤷🏼♂️