E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
80 stars 57 forks source link

Occasional outrageously high T with Cess-style ne1024 on frontier (possibly bad node frontier08656) #2560

Open ndkeen opened 1 year ago

ndkeen commented 1 year ago

With Sep28th checkout of cess branch, Chris T sees error below on ~8th day of ne1024 case.


14896: /autofs/nccs-svm1_home1/terai/SCREAM/code_with_shoc_energy_fix_230927/components/eamxx/src/share/atm_process/atmosphere_process.cpp:442
14896: Error! Failed post-condition property check (cannot be repaired).
14896:   - Atmosphere process name: homme
14896:   - Property check name: T_mid within interval [100, 500]
14896:   - Atmosphere process MPI Rank: 14896
14896:   - Message: Check failed.
14896:   - check name: T_mid within interval [100, 500]
14896:   - field id: T_mid[Physics PG2] <double:ncol,lev>(1536,128) [K]
14896:   - minimum:
14896:     - value: 195.1
14896:     - indices (w/ global column index): (11400566,38)
14896:     - lat/lon: (19.0612, 165.74)
14896:     - additional data (w/ local column index):
14896:
14896:      phis<ncol>(1536)
14896:
14896:   phis(1525)
14896:     0,
14896:
14896:      landfrac<ncol>(1536)
14896:
14896:   landfrac(1525)
14896:     0,
14896:
14896:     END OF ADDITIONAL DATA
14896:
14896:   - maximum:
14896:     - value: 9406.78
14896:     - indices (w/ global column index): (11281791,7)
14896:     - lat/lon: (16.6279, 165.872)
14896:     - additional data (w/ local column index):

/lustre/orion/cli115/proj-shared/terai/e3sm_scratch/cess-v2-test.shoc_energy_fix.lambda_high_tuning.ne1024pg2_ne1024pg2.F2010-SCREAMv1.20230928

I had also seen this crazy high T value with some previous ne1024 cases that I noted to Chris, but because another case completed 1 day, was hoping it was not repeatable.

With my Sep25th checkout of cess branch, here are some notes:

This case completed 1 day (after first attempt hitting MPICH error during init)

/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/cess-sep25/fcess-v1-cntl.ne1024pg2_ne1024pg2.F2010-SCREAMv1.cess-sep25.n2048.fr.olim-noACI

This case hits outrageous-T error after 54 steps (again with 2 attempts hitting MPICH error early on -- during init):

/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/cess-sep25/fcess-v2-cntl.ne1024pg2_ne1024pg2.F2010-SCREAMv1.cess-sep25.n2048.fd.sd4

This case hits outrageous-T error after 16 steps:

/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/cess-sep25/fcess-v2-cntl.ne1024pg2_ne1024pg2.F2010-SCREAMv1.cess-sep25.n2048.fd.sd4b

Statistic-wise: Since the Sep15th cess branch, I've had 19 total days complete with ne1024. And 843 days complete of ne256

crterai commented 1 year ago

I'm re-running the same case to see if it reproduces the error.

crterai commented 1 year ago

In Noel's case above, I notice that the simulation that ran for 1 day has the same bfbhashes as the one that crashes after step 54 steps. one that crashes after 54 steps.

[terai@login14.frontier run]$ grep "bfbhash" e3sm.log.1449627.230927-213619 
    0: bfbhash>              0 3b4d8a9f54d920e0 (Hommexx)
    0: bfbhash>             18 1f90c6b98a867e94 (Hommexx)
    0: bfbhash>             36 687b6af3aaeb4937 (Hommexx)
    0: bfbhash>             54 4b59e08e2b2b1e76 (Hommexx)

and the one that continues:

[terai@login14.frontier run]$ zgrep "bfbhash" e3sm.log.1447414.230925-191502.gz 
    0: bfbhash>              0 3b4d8a9f54d920e0 (Hommexx)
    0: bfbhash>             18 1f90c6b98a867e94 (Hommexx)
    0: bfbhash>             36 687b6af3aaeb4937 (Hommexx)
    0: bfbhash>             54 4b59e08e2b2b1e76 (Hommexx)
    0: bfbhash>             72 db54aedaa8e0318d (Hommexx)
ndkeen commented 1 year ago

Just tracking job info.

The two jobs that crashed for me:
jobid 1448414 failed with first error message on rank 14424 or node frontier08656
jobid 1449627 failed with first error message on rank 14968 or node frontier08656

Chris job crash:
jobid 1451439 failed with first error message on rank 14896 or node frontier08656

My most recent ne1024 job 1451081 completed 1 day and also used frontier08656, but we know that Chris's job actually ran over 3 days before hitting the crash

I did send email to OLCF about this

ndkeen commented 1 year ago

Ha. I was able to reproduce the issue using only 1 node with ne30. I requested to run on the "bad node". /lustre/orion/cli115/proj-shared/noel/e3sm_scratch/cess-sep28/cess-v2-cntl.ne30pg2_ne30pg2.F2010-SCREAMv1.cess-sep28.n0001t8x111661.withfrontier08656

PeterCaldwell commented 1 year ago

I'd really like it if OLCF would take the bad node out of production while we perform our Cess runs - my recollection/understanding is that queue time is much longer if we are picky about what nodes we get. Has OLCF gotten back to us about this issue yet? Do we have a contact person about it?

ndkeen commented 1 year ago

I've only recvd automated response email and see the ticket OLCFHELP-14845 was created. In general, requesting to avoid a certain node can only increase Q wait time, but I don't have a feel for how much -- maybe not significantly. I did get reply requesting more information, which I provided. OLCF replied that they will remove the node from the pool, but asked if I could say anything more about what we were doing on the node that might be causing this -- which I don't really know. They have some tests, but do not show issue on the node.