Closed ndkeen closed 3 years ago
In subroutine calc_rime_density
of micro_p3.F90
, I spotted the following:
real(rtype) :: iTc = 0.0_rtype
real(rtype) :: Vt_qc = 0.0_rtype
real(rtype) :: D_c = 0.0_rtype
real(rtype) :: V_impact = 0.0_rtype
real(rtype) :: Ri = 0.0_rtype
where the compiler will assume an implicit "save" attribute. When I replace with simply:
real(rtype) :: iTc
real(rtype) :: Vt_qc
real(rtype) :: D_c
real(rtype) :: V_impact
real(rtype) :: Ri
iTc = 0.0_rtype
Vt_qc = 0.0_rtype
D_c = 0.0_rtype
V_impact = 0.0_rtype
Ri = 0.0_rtype
I am able to get BFB behavior with several tests. I'm still not completely sure if this happens because of my change or not as I need to think about what will happen in the routine if those variables are saved. Though getting BFB results with ne4 and ne30 tests is hard to beat.
Nice! This is a bug because a save
variable in a threaded region is shared among threads unless !$omp threadprivate (...)
is used, and the decl assignments make these variables implicitly save
.
Yea, so if it were just an issue with fortran implicit save attribute, we would also see a problem without threading. It makes sense that it has to be something only causing issue when threads enabled. Not sure why it always seemed to be incorrect with DEBUG but not without as I would think issue still there.
Re: DEBUG vs not, I think the answer is similar to the one given here: https://github.com/E3SM-Project/E3SM/issues/3628#issuecomment-734003698. Basically, the OpenMP memory model is such that the optimizer pass of the compiler will choose to keep and use local copies of the save
variables until the end of the computation, so the computation in this case gives the right answer.
I ran a non-DEBUG ne256 case for 24 steps with and without this change and indeed the results are BFB.
I ran some ne256 tests with DEBUG, SVML flag, and the implicit save fix on theta.
These 2 passed:
PEM_D_Ln48_P16384x2.ne256pg2_r0125_oRRS18to6v3.F2010-SCREAM-HR-DYAMOND2
REP_D_Ln48_P16384x2.ne256pg2_r0125_oRRS18to6v3.F2010-SCREAM-HR-DYAMOND2
The similar ERS test did not complete and it's not clear why. Not obvious if it ran out of time or hung. But this is further confidence that the implicit save fix is good idea.
Most recent work on KNL (cori and theta), but I did verify this DEBUG/thread/scream test also fails on compy. I see the below tests fails with dyamond2 branch as well as scream master of of March 3rd. Without DEBUG, I have not yet encountered non-bfb issues on cori-knl. With DEBUG and using threading, it looks like there is an issue. I can run tests using
ne4pg2_ne4pg2.FC5AV1C-L
that pass, so it's also contained to scream compsets.Using tests such as these will fail:
I don't think the particular compset matters (ie F2010/F2000 or LR/HR). The default PE layout on cori-knl for these ne4 problems is single-threaded 96 MPI's (as there are only 96 elements). Using
PMx2
forces threads (I also verified same issue with using less MPI's ie 48 MPI's with 2 threads each).Note that I also see this same non-bfb issues on Theta. However, because we also see non-bfb even when building without DEBUG (!) on Theta, I was going to treat that as a different issue for now. On Theta, the non-bfb behavior with DEBUG is also present (with dyamond2 branch and scream master as of March 1st)
Here is confluence page that I was using to document some details, but maybe better to use github. https://acme-climate.atlassian.net/wiki/spaces/NGDNA/pages/2377121851/Tracking+errors+and+non-BFB+issues+with+dyamond2+branch
When I run 2 different cases and only change the number of MPI's (which is what PEM test should be doing), I see that the values printed in atm.log for the 2 cases diverge quickly -- at ATM step=1.
I am running the following 4 tests where I will compare results of 64 MPI's and 96 MPI's (with 2 threads) for only 2 days: 1) normal except with rrtmg instead of rrtmgp -- results are not BFB 2) normal except with clubb_sgs instead of shoc_sgs -- results not BFB 3) normal except with mg2 instead of p3 -- results ARE BFB 4) use all older options together: rrtmg, clubb, mg2 -- results ARE BFB
Therefore, because of (3), I think we can say that the source of non-bfb is within P3 as using shoc/mg2/rrtmgp seems OK.
Testing with older repos, I found that the same is true (ie results not BFB with P3 and DEBUG/threads) with scream repos of Dec 11th 2020, Oct 22nd 2020, and July 21st 2020. Note that it was only recently that we could run with DEBUG and threads due to the MPI_Waitall bug (fixed in December https://github.com/E3SM-Project/E3SM/pull/3974).
I have been running many experiments on cori-knl where I change the compiler flags. If I build all code with DEBUG=TRUE, but then use
-O2
or-O1
instead of-O0
(as well as remove-fpe0
), the ne4 tests look BFB.Again on cori-knl, using changes to
Depends.intel.cmake
I then narrowed down that themicro_p3.F90
source file seems to be where the non-bfb issue arises. If I build all code with "normal" DEBUG flags, but buildmicro_p3.F90
with-O1
or-O2
instead of-O0
, the results are the same (when comparing ne4 64x2 vs 96x2 after 2 days). I also need to remove the-fpe0
flag. Still doesn't make sense.I verified that when I build with DEBUG=TRUE, but change the flags for
micro_p3.F90
only (to use-O1
and remove-fpe0
), the test I've been using is BFB as well as the following "standard" tests:By using openmp critical sections, I can narrow down the issue a little further. If the main loop in subroutine
p3_main_part2
in micro_p3.F90 is within a thread critical section, the results are BFB (at least with the ne4 tests I've been trying). I'm tried to narrow down within the loop, but it may not be straight-forward as I suspect there is a thread-race-like issue there and any thread-related change can change the outcome.