E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
351 stars 360 forks source link

get_proc_bounds called from threaded region #1369

Open amametjanov opened 7 years ago

amametjanov commented 7 years ago

If there is an error in components/clm/src/biogeophys/BalanceCheckMod.F90, there is a call GetGlobalIndex which is expected to be called from non-threaded regions: GetGlobalIndex calls get_proc_bounds, which throws an abort.

The full stack trace is https://gist.github.com/amametjanov/0b67cf3f9ccec5b67b986ca1d2e43bb6 .

Possible fixes are to either remove the check in get_proc_bounds_new or surround the call sites with !$omp single ... !$omp end single.

bishtgautam commented 7 years ago

@amametjanov: This is a good catch. I'm surprised it wasn't caught early. For the long term fix, I will have a discussion with NCAR CSEG. In the meantime, the easiest fix I can think of is to disable the call to GetGlobalIndex().

bishtgautam commented 7 years ago

This bug has been reported to NCAR CSEG as bug-2439.

amametjanov commented 7 years ago

Agreed about disabling the call to GetGlobalIndex. But acme.log also had the I/O recursion detected message:

13296:  WARNING:  snow balance error  
13296: XL Fortran (I/O initialization): I/O recursion detected.
2017-04-05 05:45:16.431 (WARN ) [0xfffb1aaca70] MIR-08000-3B7F1-4096:2236325:ibm.runjob.client.Job: terminated by signal 6
2017-04-05 05:45:16.431 (WARN ) [0xfffb1aaca70] MIR-08000-3B7F1-4096:2236325:ibm.runjob.client.Job: abnormal termination by signal 6 from rank 13296

which @jayeshkrishna recommends to fix by surrounding the write statements with !$omp master ... !$omp end master. Could you push a fix to master? Needed by @PeterCaldwell.

BTW, any idea why an ne120-A_WCYCL run is producing 'WARNING: snow balance error ' and how to avoid it?

bishtgautam commented 7 years ago

A shortcut fix is now open https://github.com/ACME-Climate/ACME/pull/1370

amametjanov commented 7 years ago

Any news on a possible resolution? I'm seeing snow balance error messages in the runs that hang with 4x16 PE layouts for LND.

Only one thread can be doing IO on a given unit at a time. A proposed patch for review is here: https://gist.github.com/amametjanov/48c7f1d6eb9a97391a97cb0155c9604e.

amametjanov commented 7 years ago

One more related item from the same location: acme.log sometimes contains messages like

3625:  BalanceCheck: soil balance error (mm) 
3625:  nstep         =  420  
3625:  errsoi_col    =  -0.284474305140369060E-02
3625:  clm model is stopping

but the run continues because the call to endrun is commented out: https://github.com/ACME-Climate/ACME/blob/master/components/clm/src/biogeophys/BalanceCheckMod.F90#L673 . So this must be a warning instead of an error?

thorntonpe commented 7 years ago

I have some concerns about having this balance check endrun commented out. We should turn it back on, and if we are concerned about fatal errors for trivial balance problems then we should loosen the tolerance.

bishtgautam commented 7 years ago

@amametjanov: Regarding the proposed patch, can the !$omp critical be moved within the if (found) loop?

@thorntonpe: In clm4_5_1_r082 tag, the call endrun() on BalanceCheckMod.F90#L677 was commented out.

amametjanov commented 7 years ago

It's not allowed to branch into or out of a critical block: either all threads encounter it or none. It looks like the found condition might not be true for all threads and so had to surround the whole if-block with that.

Another alternative is to call a single write with a format statement like write(iulog,format_id): e.g. https://github.com/ACME-Climate/ACME/blob/master/components/cam/src/physics/cam/qneg3.F90#L121. But that is also error-prone, because, as the number of on-node MPIxOMP PEs increases, the chance of multiple PEs writing to the same iulog unit and causing an error or a hang goes up.

bishtgautam commented 7 years ago

I should have read the restrictions section for OMP Critical more carefully. Your propsed solution looks good to me. Do you want to issue a PR or should I?

amametjanov commented 7 years ago

Since balance check warning messages are relatively rare, let me test a lighter-weight approach of a single write call without any synchronization: patch here. If this + disabling all other diagnostic messages removes the hang/deadlock, will proceed with a PR. Thanks.

bishtgautam commented 7 years ago

Probably adding //NEW_LINE('')// would split the output into mulitple lines and make it easire to read.

amametjanov commented 7 years ago

Yes, but new_line is an intrinsic function added in Fortran 2003 standard. Didn't want to add a new constraint. Would //achar(10)// (F77) be okay? Or does it need to be //achar(10)//achar(13)//?

bishtgautam commented 7 years ago

//achar(10)// should be sufficient. Thanks.