E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
343 stars 352 forks source link

BFBFLAG=True still gives NBFB results with different PElayout on compy #4038

Closed wlin7 closed 3 years ago

wlin7 commented 3 years ago

BFBFLAG=True is expected to give BFB reproducibility with different PElayout. It is not the case with a recent master running on compy and intel compiler.

The problem can be produced with following code and configuration

          master hash c0b0c779bbf6728153765e02a37d6057f6a73cd8
           compset: A_WCYCL1850S_CMIP6
           grid:         ne30pg2_r05_EC30to60E2r2-1900_ICG        

The simulations were done using the following two run scripts (the 1ts one uses 90 nodes -- PE=L, the 2nd 46 nodes -- PE=M, all MPI)

 /qfs/people/linw288/E3SM/Cases/prod/scripts/run.20201225.alpha5_59_fallback.piControl.EC30to60E2r2.compy.csh
/qfs/people/linw288/E3SM/Cases/prod/scripts/run.20210114M.alpha5_59_fallback.piControl.EC30to60E2r2.compy.csh
singhbalwinder commented 3 years ago

So far my tests have revealed what @worleyph's mentioned earlier that it is the tpert variable which is causing this nbfb behavior. I can see in my tests that tpert takes different values in zm_convr subroutine. For testing, I zeroed out tpert and the simulations were BFB. 'tpert` in itself is BFB if I add it to the history files and compare.

I am now trying to understand why tpert is different in zm_convr subroutine for the same column with load balance=0 vs. 2 cases. So far it seems like it is the way ZM collects columns to act on (ZM doesn't act on all the columns). I will report back as I learn more about it.

worleyph commented 3 years ago

Applying '-init=snan,arrays' just to EAM (not HOMME) resulted in a failure at the following location, but this again occurs whether user_nl_eam is modified or not. Looking at the code, this is due to sloppy coding, and is innocuous (I believe), but should be fixed.

 [0] forrtl: error (75): floating point exception
 [0] Image              PC                Routine            Line        Source
 [0] libpnetcdf.so.3.0  0000155550C657BC  for__signal_handl     Unknown  Unknown
 [0] libpthread-2.28.s  000015554D622DD0  Unknown               Unknown  Unknown
 [0] e3sm.exe           000000000230F03C  clubb_intr_mp_clu        2668  clubb_intr.F90
 [0] e3sm.exe           000000000285BF49  physpkg_mp_tphysb        2503  physpkg.F90
 [0] e3sm.exe           000000000282C48C  physpkg_mp_phys_r        1045  physpkg.F90
 [0] e3sm.exe           000000000089514F  cam_comp_mp_cam_r         250  cam_comp.F90
 [0] e3sm.exe           0000000000852B74  atm_comp_mct_mp_a         396  atm_comp_mct.F90
 [0] e3sm.exe           0000000000482E9B  component_mod_mp_         257  component_mod.F90
 [0] e3sm.exe           000000000043EB47  cime_comp_mod_mp_        2281  cime_comp_mod.F90
 [0] e3sm.exe           00000000004799B5  MAIN__                    122  cime_driver.F90

Code is

    where (kbfs .eq. -0.0_r8) kbfs = 0.0_r8

with kbfs declared as

   real(r8) :: kbfs(pcols)

but only initialized with

    do i=1,ncol
 ...
       call calc_obklen( th(i,pver), thv(i,pver), cam_in%cflx(i,1), cam_in%shf(i), rrho, ustar2(i), &
                         kinheat(i), kinwat(i), kbfs(i), obklen(i) )
    enddo
worleyph commented 3 years ago

Hopefully @singhbalwinder is making real progress :-). Perhaps we need to put together a campaign to identify and fix all failures identified by '-init=snan,arrays', but this is a serial, one person, job using this approach.

singhbalwinder commented 3 years ago

With the help from @whannah1 , I was able to locate what was causing this NBFB behavior. It was indeed related to the way ZM "gathers" or "collects" columns to act on. My 10 time step runs are BFB with the namelist changes @worleyph suggested for the ne4 grid (phys_loadbalance = 0 vs. 2).

I will run some more tests and issue a PR soon so that others can test it as well.

wlin7 commented 3 years ago

Great work, @singhbalwinder . It is interesting why the issue only shows up with tpert, since ZM does this gather to act thing for all the variables. Understandably due to such action, it would alter the phys_loadbalance to some extent. Looking forward to your PR.

singhbalwinder commented 3 years ago

ZM gathers these columns for all variables but tpert was a new addition and was missing from that loop. I have now added tpert into that loop.

wlin7 commented 3 years ago

That makes sense. Thanks @singhbalwinder .

worleyph commented 3 years ago

Great work @singhbalwinder and @whannah1 .

rljacob commented 3 years ago

No closing until this fix is merged to master.