Error in Water Balance: "The model is losing water (ERRWAT is negative)"

lrbison commented 4 months ago

This issue appears when running WRF in dm+sm mode. It was reported on aarch4 (Graviton3: neoverse-v1). The symptom is that WRF calls MPI_Abort, but doesn't print any message. However re-running the same input often succeeds, and failures only happen occasionally (typically on the first timestep).

Upon further investigation, it seems that the non-master thread is calling wrf_error_fatal from here: https://github.com/NCAR/noahmp/blob/release-v4.5-WRF/src/module_sf_noahmplsm.F#L1727 however none of the messages are printed, because in wrf_message, all output is guarded by an !$OMP MASTER block, and it seems the error is being triggered from non-master threads.

With the print enabled, we found a few grid points would occasionally lose water in the order of >.1 but <1 kg/m^2/dt. Investigation into the error cause showed that the scalar terms contributing to the water balance were identical between failing and successful runs. The primary difference was in the soil moisture. Diffing the output dataset showed no corrupt-looking data, only small differences induced by the stochastic energy flux methods.

Eventually I discovered what I believe to be the root cause: calculate_soil is being assigned twice within noahmplsm. First it is set to .false. then if a modulo is 0, then it is set to .true.. However the variable is scoped to the whole module, so all threads share the storage of calculate_soil. This leaves the potential for thread B to have passed this initialization block, and try to use the value while thread A is between the .false. and .true. assignments, resulting in an inconsistent value of calculate_soil to be observed by thread B during the subroutine execution.

lrbison commented 4 months ago

~~PR is merged. Thank you!~~ oops, misread that

cenlinhe commented 4 months ago

we will merge the PR very soon after some internal testing and we will close this issue once it is merged. thank you!

NCAR / noahmp

Error in Water Balance: "The model is losing water (ERRWAT is negative)" #135