Results not BFB when using threads, DEBUG, scream compset on any machine (ie PEM_D_PMx2.ne4pg2_ne4pg2.F2010-SCREAM-LR)

ndkeen commented 3 years ago

Most recent work on KNL (cori and theta), but I did verify this DEBUG/thread/scream test also fails on compy. I see the below tests fails with dyamond2 branch as well as scream master of of March 3rd. Without DEBUG, I have not yet encountered non-bfb issues on cori-knl. With DEBUG and using threading, it looks like there is an issue. I can run tests using ne4pg2_ne4pg2.FC5AV1C-L that pass, so it's also contained to scream compsets.

Using tests such as these will fail:

ERS_D_PMx2.ne4pg2_ne4pg2.F2010-SCREAM-HR
PEM_D_PMx2.ne4pg2_ne4pg2.F2010-SCREAM-LR
REP_D_PMx2.ne4pg2_ne4pg2.F2010-SCREAM-HR

I don't think the particular compset matters (ie F2010/F2000 or LR/HR). The default PE layout on cori-knl for these ne4 problems is single-threaded 96 MPI's (as there are only 96 elements). Using PMx2 forces threads (I also verified same issue with using less MPI's ie 48 MPI's with 2 threads each).

Note that I also see this same non-bfb issues on Theta. However, because we also see non-bfb even when building without DEBUG (!) on Theta, I was going to treat that as a different issue for now. On Theta, the non-bfb behavior with DEBUG is also present (with dyamond2 branch and scream master as of March 1st)

Here is confluence page that I was using to document some details, but maybe better to use github. https://acme-climate.atlassian.net/wiki/spaces/NGDNA/pages/2377121851/Tracking+errors+and+non-BFB+issues+with+dyamond2+branch

When I run 2 different cases and only change the number of MPI's (which is what PEM test should be doing), I see that the values printed in atm.log for the 2 cases diverge quickly -- at ATM step=1.

I am running the following 4 tests where I will compare results of 64 MPI's and 96 MPI's (with 2 threads) for only 2 days: 1) normal except with rrtmg instead of rrtmgp -- results are not BFB 2) normal except with clubb_sgs instead of shoc_sgs -- results not BFB 3) normal except with mg2 instead of p3 -- results ARE BFB 4) use all older options together: rrtmg, clubb, mg2 -- results ARE BFB

Therefore, because of (3), I think we can say that the source of non-bfb is within P3 as using shoc/mg2/rrtmgp seems OK.

Testing with older repos, I found that the same is true (ie results not BFB with P3 and DEBUG/threads) with scream repos of Dec 11th 2020, Oct 22nd 2020, and July 21st 2020. Note that it was only recently that we could run with DEBUG and threads due to the MPI_Waitall bug (fixed in December https://github.com/E3SM-Project/E3SM/pull/3974).

I have been running many experiments on cori-knl where I change the compiler flags. If I build all code with DEBUG=TRUE, but then use -O2 or -O1 instead of -O0 (as well as remove -fpe0), the ne4 tests look BFB.

Again on cori-knl, using changes to Depends.intel.cmake I then narrowed down that the micro_p3.F90 source file seems to be where the non-bfb issue arises. If I build all code with "normal" DEBUG flags, but build micro_p3.F90 with -O1 or -O2 instead of -O0, the results are the same (when comparing ne4 64x2 vs 96x2 after 2 days). I also need to remove the -fpe0 flag. Still doesn't make sense.

I verified that when I build with DEBUG=TRUE, but change the flags for micro_p3.F90 only (to use -O1 and remove -fpe0), the test I've been using is BFB as well as the following "standard" tests:

ERS_D_P48x2.ne4pg2_ne4pg2.F2010-SCREAM-HR
PEM_D_P48x2.ne4pg2_ne4pg2.F2010-SCREAM-LR
REP_D_P48x2.ne4pg2_ne4pg2.F2010-SCREAM-HR

By using openmp critical sections, I can narrow down the issue a little further. If the main loop in subroutine p3_main_part2 in micro_p3.F90 is within a thread critical section, the results are BFB (at least with the ne4 tests I've been trying). I'm tried to narrow down within the loop, but it may not be straight-forward as I suspect there is a thread-race-like issue there and any thread-related change can change the outcome.

ndkeen commented 3 years ago

In subroutine calc_rime_density of micro_p3.F90, I spotted the following:

   real(rtype) :: iTc = 0.0_rtype
   real(rtype) :: Vt_qc = 0.0_rtype
   real(rtype) :: D_c  = 0.0_rtype
   real(rtype) :: V_impact = 0.0_rtype
   real(rtype) :: Ri = 0.0_rtype

where the compiler will assume an implicit "save" attribute. When I replace with simply:

   real(rtype) :: iTc
   real(rtype) :: Vt_qc
   real(rtype) :: D_c
   real(rtype) :: V_impact
   real(rtype) :: Ri

   iTc = 0.0_rtype
   Vt_qc = 0.0_rtype
   D_c  = 0.0_rtype
   V_impact = 0.0_rtype
   Ri = 0.0_rtype

I am able to get BFB behavior with several tests. I'm still not completely sure if this happens because of my change or not as I need to think about what will happen in the routine if those variables are saved. Though getting BFB results with ne4 and ne30 tests is hard to beat.

ambrad commented 3 years ago

Nice! This is a bug because a save variable in a threaded region is shared among threads unless !$omp threadprivate (...) is used, and the decl assignments make these variables implicitly save.

ndkeen commented 3 years ago

Yea, so if it were just an issue with fortran implicit save attribute, we would also see a problem without threading. It makes sense that it has to be something only causing issue when threads enabled. Not sure why it always seemed to be incorrect with DEBUG but not without as I would think issue still there.

ambrad commented 3 years ago

Re: DEBUG vs not, I think the answer is similar to the one given here: https://github.com/E3SM-Project/E3SM/issues/3628#issuecomment-734003698. Basically, the OpenMP memory model is such that the optimizer pass of the compiler will choose to keep and use local copies of the save variables until the end of the computation, so the computation in this case gives the right answer.

ndkeen commented 3 years ago

I ran a non-DEBUG ne256 case for 24 steps with and without this change and indeed the results are BFB.

ndkeen commented 3 years ago

I ran some ne256 tests with DEBUG, SVML flag, and the implicit save fix on theta.

These 2 passed:

PEM_D_Ln48_P16384x2.ne256pg2_r0125_oRRS18to6v3.F2010-SCREAM-HR-DYAMOND2
REP_D_Ln48_P16384x2.ne256pg2_r0125_oRRS18to6v3.F2010-SCREAM-HR-DYAMOND2

The similar ERS test did not complete and it's not clear why. Not obvious if it ran out of time or hung. But this is further confidence that the implicit save fix is good idea.

E3SM-Project / scream

Results not BFB when using threads, DEBUG, scream compset on any machine (ie PEM_D_PMx2.ne4pg2_ne4pg2.F2010-SCREAM-LR) #911