MPI timeout (?) sometimes in Izumi nag tests

ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)

http://www.cesm.ucar.edu/models/cesm2.0/land/

Other

308 stars 311 forks source link

MPI timeout (?) sometimes in Izumi nag tests #2800

Open samsrabin opened 3 weeks ago

samsrabin commented 3 weeks ago

Brief summary of bug

Some Izumi nag tests sometimes fail in the run phase with Warning: Floating underflow occurred in cesm.log. Re-submitting usually fixes it.

General bug information

CTSM version you are using: ctsm5.3.002 (but this has happened to me before, not just with this tag)

Does this bug cause significantly incorrect results in the model's science? No?

Configurations affected: Izumi nag

Details of bug

Affected tests (today, that is):

ERI_D_Ld9_P48x1.f10_f10_mg37.I2000Clm50Sp.izumi_nag.clm-reduceOutput
ERP_D_Ld9.f10_f10_mg37.I1850Clm60BgcCrop.izumi_nag.clm-clm60cam7LndTuningModeLDust
ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream
SMS_D_Ld5.f45_f45_mg37.I2000Clm60Fates.izumi_nag.clm-FatesCold

Unfortunately there's no useful traceback, so I'm not sure what's going on. However, it always happens after the CTSM: end of main integration loop message is printed in lnd.log.

ekluzek commented 3 weeks ago

@samsrabin floating underflow is something that we should expect in CTSM as a natural occurrence. Something just got so small it was truncated to exactly zero. In some codes that could be a problem but not something we manage. So the underflow isn't the real issue here.

I think you are talking about the cases that close with MPI timeout launcher errors as discussed for example in #1317. right? My suspicion is an MPI race condition that only happens randomly. There are also cases where the MPI timeout launcher error is a legit issue in the code.

I wanted to make sure we are talking about the same thing as if so, I think we should change the title. We can also talk about this tomorrow as it has next on it.

samsrabin commented 3 weeks ago

Yeah, that's what I'm talking about, although the messaging looks different. What I could do is just add the new messaging in a comment to that issue and close this one, so future searches will find it.

ekluzek commented 3 weeks ago

Sounds good. The tail of cesm.log for a case I just resubmitted looks like this:

[1] 208 at [0x000000000d143160], src/mpid/ch3/src/mpid_vc.c[110]
[1] 96 at [0x000000000d153480], src/util/procmap/local_proc.c[93]
[1] 96 at [0x000000000d143f10], src/util/procmap/local_proc.c[92]
[1] 208 at [0x000000000d153310], src/mpid/ch3/src/mpid_vc.c[110]
[1] 96 at [0x00000000084db040], src/util/procmap/local_proc.c[93]
[1] 96 at [0x00000000084daf40], src/util/procmap/local_proc.c[92]
[1] 504 at [0x0000000008212700], src/mpi/comm/commutil.c[328]
[1] 504 at [0x0000000008212460], src/mpi/comm/commutil.c[328]
[1] 504 at [0x0000000008211f20], src/mpi/comm/commutil.c[328]
[1] 208 at [0x000000000820f6b0], src/mpid/ch3/src/mpid_vc.c[110]
Warning: Floating underflow occurred
[mpiexec@i023.cgd.ucar.edu] HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002
[mpiexec@i023.cgd.ucar.edu] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@i023.cgd.ucar.edu] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@i023.cgd.ucar.edu] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion

If I search for issues or PR's with ""launcher returned error waiting for completion" I find a few that cover it.

samsrabin commented 3 weeks ago

Actually, I'm just going to leave this one open. The other issue looked like it was a consistent thing, whereas now it's random. I'll change the title.

olyson commented 2 weeks ago

I also encountered floating underflow in this test in my testing of PR #2806 ERI_D_Ld9_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-reduceOutput