Open amametjanov opened 7 years ago
@amametjanov: This is a good catch. I'm surprised it wasn't caught early. For the long term fix, I will have a discussion with NCAR CSEG. In the meantime, the easiest fix I can think of is to disable the call to GetGlobalIndex().
This bug has been reported to NCAR CSEG as bug-2439.
Agreed about disabling the call to GetGlobalIndex. But acme.log also had the I/O recursion detected
message:
13296: WARNING: snow balance error
13296: XL Fortran (I/O initialization): I/O recursion detected.
2017-04-05 05:45:16.431 (WARN ) [0xfffb1aaca70] MIR-08000-3B7F1-4096:2236325:ibm.runjob.client.Job: terminated by signal 6
2017-04-05 05:45:16.431 (WARN ) [0xfffb1aaca70] MIR-08000-3B7F1-4096:2236325:ibm.runjob.client.Job: abnormal termination by signal 6 from rank 13296
which @jayeshkrishna recommends to fix by surrounding the write statements with !$omp master ... !$omp end master. Could you push a fix to master? Needed by @PeterCaldwell.
BTW, any idea why an ne120-A_WCYCL run is producing 'WARNING: snow balance error '
and how to avoid it?
A shortcut fix is now open https://github.com/ACME-Climate/ACME/pull/1370
Any news on a possible resolution? I'm seeing snow balance error
messages in the runs that hang with 4x16 PE layouts for LND.
Only one thread can be doing IO on a given unit at a time. A proposed patch for review is here: https://gist.github.com/amametjanov/48c7f1d6eb9a97391a97cb0155c9604e.
One more related item from the same location: acme.log sometimes contains messages like
3625: BalanceCheck: soil balance error (mm)
3625: nstep = 420
3625: errsoi_col = -0.284474305140369060E-02
3625: clm model is stopping
but the run continues because the call to endrun
is commented out: https://github.com/ACME-Climate/ACME/blob/master/components/clm/src/biogeophys/BalanceCheckMod.F90#L673 . So this must be a warning instead of an error?
I have some concerns about having this balance check endrun commented out. We should turn it back on, and if we are concerned about fatal errors for trivial balance problems then we should loosen the tolerance.
@amametjanov: Regarding the proposed patch, can the !$omp critical
be moved within the if (found)
loop?
@thorntonpe: In clm4_5_1_r082 tag, the call endrun()
on BalanceCheckMod.F90#L677 was commented out.
It's not allowed to branch into or out of a critical block: either all threads encounter it or none. It looks like the found
condition might not be true for all threads and so had to surround the whole if-block with that.
Another alternative is to call a single write with a format statement like write(iulog,format_id)
: e.g. https://github.com/ACME-Climate/ACME/blob/master/components/cam/src/physics/cam/qneg3.F90#L121. But that is also error-prone, because, as the number of on-node MPIxOMP PEs increases, the chance of multiple PEs writing to the same iulog unit and causing an error or a hang goes up.
I should have read the restrictions section for OMP Critical more carefully. Your propsed solution looks good to me. Do you want to issue a PR or should I?
Since balance check warning messages are relatively rare, let me test a lighter-weight approach of a single write
call without any synchronization: patch here. If this + disabling all other diagnostic messages removes the hang/deadlock, will proceed with a PR. Thanks.
Probably adding //NEW_LINE('')//
would split the output into mulitple lines and make it easire to read.
Yes, but new_line
is an intrinsic function added in Fortran 2003 standard. Didn't want to add a new constraint. Would //achar(10)//
(F77) be okay? Or does it need to be //achar(10)//achar(13)//
?
//achar(10)//
should be sufficient. Thanks.
If there is an error in components/clm/src/biogeophys/BalanceCheckMod.F90, there is a call
GetGlobalIndex
which is expected to be called from non-threaded regions: GetGlobalIndex calls get_proc_bounds, which throws an abort.The full stack trace is https://gist.github.com/amametjanov/0b67cf3f9ccec5b67b986ca1d2e43bb6 .
Possible fixes are to either remove the check in get_proc_bounds_new or surround the call sites with !$omp single ... !$omp end single.