Open ndkeen opened 1 year ago
I'm re-running the same case to see if it reproduces the error.
In Noel's case above, I notice that the simulation that ran for 1 day has the same bfbhashes as the one that crashes after step 54 steps. one that crashes after 54 steps.
[terai@login14.frontier run]$ grep "bfbhash" e3sm.log.1449627.230927-213619
0: bfbhash> 0 3b4d8a9f54d920e0 (Hommexx)
0: bfbhash> 18 1f90c6b98a867e94 (Hommexx)
0: bfbhash> 36 687b6af3aaeb4937 (Hommexx)
0: bfbhash> 54 4b59e08e2b2b1e76 (Hommexx)
and the one that continues:
[terai@login14.frontier run]$ zgrep "bfbhash" e3sm.log.1447414.230925-191502.gz
0: bfbhash> 0 3b4d8a9f54d920e0 (Hommexx)
0: bfbhash> 18 1f90c6b98a867e94 (Hommexx)
0: bfbhash> 36 687b6af3aaeb4937 (Hommexx)
0: bfbhash> 54 4b59e08e2b2b1e76 (Hommexx)
0: bfbhash> 72 db54aedaa8e0318d (Hommexx)
Just tracking job info.
The two jobs that crashed for me:
jobid 1448414 failed with first error message on rank 14424 or node frontier08656
jobid 1449627 failed with first error message on rank 14968 or node frontier08656
Chris job crash:
jobid 1451439 failed with first error message on rank 14896 or node frontier08656
My most recent ne1024 job 1451081 completed 1 day and also used frontier08656, but we know that Chris's job actually ran over 3 days before hitting the crash
I did send email to OLCF about this
Ha. I was able to reproduce the issue using only 1 node with ne30. I requested to run on the "bad node".
/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/cess-sep28/cess-v2-cntl.ne30pg2_ne30pg2.F2010-SCREAMv1.cess-sep28.n0001t8x111661.withfrontier08656
I'd really like it if OLCF would take the bad node out of production while we perform our Cess runs - my recollection/understanding is that queue time is much longer if we are picky about what nodes we get. Has OLCF gotten back to us about this issue yet? Do we have a contact person about it?
I've only recvd automated response email and see the ticket OLCFHELP-14845 was created. In general, requesting to avoid a certain node can only increase Q wait time, but I don't have a feel for how much -- maybe not significantly. I did get reply requesting more information, which I provided. OLCF replied that they will remove the node from the pool, but asked if I could say anything more about what we were doing on the node that might be causing this -- which I don't really know. They have some tests, but do not show issue on the node.
With Sep28th checkout of cess branch, Chris T sees error below on ~8th day of ne1024 case.
I had also seen this crazy high T value with some previous ne1024 cases that I noted to Chris, but because another case completed 1 day, was hoping it was not repeatable.
With my Sep25th checkout of cess branch, here are some notes:
This case completed 1 day (after first attempt hitting MPICH error during init)
This case hits outrageous-T error after 54 steps (again with 2 attempts hitting MPICH error early on -- during init):
This case hits outrageous-T error after 16 steps:
Statistic-wise: Since the Sep15th cess branch, I've had 19 total days complete with ne1024. And 843 days complete of ne256