Open samsrabin opened 1 month ago
@samsrabin floating underflow is something that we should expect in CTSM as a natural occurrence. Something just got so small it was truncated to exactly zero. In some codes that could be a problem but not something we manage. So the underflow isn't the real issue here.
I think you are talking about the cases that close with MPI timeout launcher errors as discussed for example in #1317. right? My suspicion is an MPI race condition that only happens randomly. There are also cases where the MPI timeout launcher error is a legit issue in the code.
I wanted to make sure we are talking about the same thing as if so, I think we should change the title. We can also talk about this tomorrow as it has next on it.
Yeah, that's what I'm talking about, although the messaging looks different. What I could do is just add the new messaging in a comment to that issue and close this one, so future searches will find it.
Sounds good. The tail of cesm.log for a case I just resubmitted looks like this:
[1] 208 at [0x000000000d143160], src/mpid/ch3/src/mpid_vc.c[110]
[1] 96 at [0x000000000d153480], src/util/procmap/local_proc.c[93]
[1] 96 at [0x000000000d143f10], src/util/procmap/local_proc.c[92]
[1] 208 at [0x000000000d153310], src/mpid/ch3/src/mpid_vc.c[110]
[1] 96 at [0x00000000084db040], src/util/procmap/local_proc.c[93]
[1] 96 at [0x00000000084daf40], src/util/procmap/local_proc.c[92]
[1] 504 at [0x0000000008212700], src/mpi/comm/commutil.c[328]
[1] 504 at [0x0000000008212460], src/mpi/comm/commutil.c[328]
[1] 504 at [0x0000000008211f20], src/mpi/comm/commutil.c[328]
[1] 208 at [0x000000000820f6b0], src/mpid/ch3/src/mpid_vc.c[110]
Warning: Floating underflow occurred
[mpiexec@i023.cgd.ucar.edu] HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002
[mpiexec@i023.cgd.ucar.edu] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@i023.cgd.ucar.edu] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@i023.cgd.ucar.edu] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
If I search for issues or PR's with ""launcher returned error waiting for completion" I find a few that cover it.
Actually, I'm just going to leave this one open. The other issue looked like it was a consistent thing, whereas now it's random. I'll change the title.
I also encountered floating underflow in this test in my testing of PR #2806 ERI_D_Ld9_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-reduceOutput
Brief summary of bug
Some Izumi nag tests sometimes fail in the run phase with
Warning: Floating underflow occurred
incesm.log
. Re-submitting usually fixes it.General bug information
CTSM version you are using: ctsm5.3.002 (but this has happened to me before, not just with this tag)
Does this bug cause significantly incorrect results in the model's science? No?
Configurations affected: Izumi nag
Details of bug
Affected tests (today, that is):
Unfortunately there's no useful traceback, so I'm not sure what's going on. However, it always happens after the
CTSM: end of main integration loop
message is printed inlnd.log
.