BFBFLAG=True still gives NBFB results with different PElayout on compy

wlin7 commented 3 years ago

BFBFLAG=True is expected to give BFB reproducibility with different PElayout. It is not the case with a recent master running on compy and intel compiler.

The problem can be produced with following code and configuration

          master hash c0b0c779bbf6728153765e02a37d6057f6a73cd8
           compset: A_WCYCL1850S_CMIP6
           grid:         ne30pg2_r05_EC30to60E2r2-1900_ICG

The simulations were done using the following two run scripts (the 1ts one uses 90 nodes -- PE=L, the 2nd 46 nodes -- PE=M, all MPI)

 /qfs/people/linw288/E3SM/Cases/prod/scripts/run.20201225.alpha5_59_fallback.piControl.EC30to60E2r2.compy.csh
/qfs/people/linw288/E3SM/Cases/prod/scripts/run.20210114M.alpha5_59_fallback.piControl.EC30to60E2r2.compy.csh

wlin7 commented 3 years ago

Hi @jonbob , can you please point me to the test that is designed to check BFB with different. PE layouts? Thanks.

jonbob commented 3 years ago

@wlin7 - sure. Here's an example:

./create_test PEM_P480_Ld5.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.anvil_intel

The "PEM" prefix is described as:

PEM modified pe counts mpi bfb test (seq tests) do an initial run with default pe layout (suffix: base) do another initial run with modified pes (NTASKS_XXX => NTASKS_XXX/2) (suffix: modpes) compare base and single_thread

The P480 just indicates the number of pes for the initial layout and is not required. The Ld5 specifies that the test will run five-days but is also not required. The rest of the test naming is the typical grid.compset.machine_compiler. But please let me know if I can help make more sense of it

wlin7 commented 3 years ago

Thanks, @jonbob . It appears this test has been failing since at least last October.

jonbob commented 3 years ago

@wlin7 - that one was perhaps a bad example. I cut-and-pasted it because I had just run it testing a PR that fixed the particular case, although those fails are due to the _ISMF coding in mpas-o and not compy or its compiler. I meant it only as an example of setting up a PEM test. You would want to try something more like:

PEM_P3600_Ld5.ne30pg2_r05_EC30to60E2r2-1900_ICG.A_WCYCL1850S_CMIP6.compy_intel

to test your exact issue

rljacob commented 3 years ago

Just noting that ERP tests will also change the task count. Here are the tests that have been passing (on sandiatoss3 but they pass on other machines we run them: bebop, anvil, theta) ERP_Ld3.ne30_oECv3_ICG.A_WCYCL1850S.sandiatoss3_intel.allactive-pioroot1 ERP_Ln9.ne4_ne4.FC5AV1C-L.sandiatoss3_intel ERP_Ln9.ne4_ne4.F-EAMv1-RCEMIP.sandiatoss3_intel ERP_Ln9.ne4_ne4.F-EAMv1-AQP1.sandiatoss3_intel

So it must have something to do with the resolution or the CMIP6 compset.

singhbalwinder commented 3 years ago

@wlin7 : How long does it take to show the differences? Do they show up right after the first time step?

wlin7 commented 3 years ago

@singhbalwinder , for my tests with PE=M vs PE=L on compy, global stats (in atrm.log) start to differ from step 2.

worleyph commented 3 years ago

So this should be reproducible with an F case then? Might make diagnosis easier.

singhbalwinder commented 3 years ago

I was going to say the same thing. @wlin7 : Have you tried an F case yet to see if this is reproducible with an F case?

wlin7 commented 3 years ago

Good Idea, @worleyph , @singhbalwinder . I haven't tried, doing it now.

singhbalwinder commented 3 years ago

If ne4 test is passing but ne30 is not, one of the differences among them is the time step. Time step also drives how many times radiation is called. If you can confirm that you see this with an F case as well, I can run a test to find out which parameterization is causing the difference.

wlin7 commented 3 years ago

Reporting back: NBFB also for F20TRC5-CMIP6 starting step 2, grid ne30pg2_r05_oECv3. One run using 4 nodes, the other 8 nodes.

wlin7 commented 3 years ago

I can run a test to find out which parameterization is causing the difference.

That would be great, Balwinder. Thanks.

rljacob commented 3 years ago

@wlin7 c0b0c779bbf67 is from Dec 24. You should confirm this happens with latest master.

rljacob commented 3 years ago

Also the BFBFLAG is not a make-everything-BFB flag. It only addresses interpolation in the coupler. Lots of other ways to break BFB when changing processor count.

worleyph commented 3 years ago

I don't see a problem (yet) with master using FC5AV1C-L and ne30pg2_ne30pg2 (160x1 compared to 80x1 and also compared phys_loadbalance=0 and phys_loadbalance=2).

wlin7 commented 3 years ago

Also the BFBFLAG is not a make-everything-BFB flag. It only addresses interpolation in the coupler. Lots of other ways to break BFB when changing processor count.

That is true, Rob. Initially I thought the problem only occur with B case and the first thing I would check in that case is. BFBFLAG. That title now appears misleading.

Also PEM_P480_Ld5.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.compy_intel failed from the latest report (Jan. 9).

rljacob commented 3 years ago

That test was failing for a while and the fix was just merged to master on Wed. PR #4025

wlin7 commented 3 years ago

@rljacob , this issue emerged while testing forPR 4007, which branched off master of Dec. 24. I used existing baseline tests from them for comparison. We would be really lucky if the issue had gone away in current master. That said, I am going to run a pair of tests on current master with that particular compset and grid.

wlin7 commented 3 years ago

That test was failing for a while and the fix was just merged to master on Wed. PR #4025

Oh, thanks. This current issue is a completely different one, then.

rljacob commented 3 years ago

PEM_PL_Ld5.ne30pg2_r05_EC30to60E2r2-1900_ICG.A_WCYCL1850S_CMIP6.chrysalis_intel PASS using 0ad588d81 (Jan 7 master)

worleyph commented 3 years ago

Tried 1 day of

 -compset F20TRC5-CMIP6 -res ne30pg2_r05_oECv3

for 4 nodes (160x1) and 8 nodes (320x1), and they were BFB with respect to the atm.log. So, I can't reproduce this issue.

rljacob commented 3 years ago

@worleyph what hash, what machine?

PEM_Ld5.ne30pg2_r05_EC30to60E2r2-1900_ICG.A_WCYCL1850S_CMIP6 also passes on anvil (0ad588d)

A reset my local master to c0b0c77 and PEM_PL_Ld5.ne30pg2_r05_EC30to60E2r2-1900_ICG.A_WCYCL1850S_CMIP6.chrysalis_intel still PASSes so this is at worst a compy problem.

worleyph commented 3 years ago

Compy, master (updated today), intel compiler

 -compset F20TRC5-CMIP6 -res ne30pg2_r05_oECv3 -project e3sm -compiler intel

 $ git describe
 v2.0.0-alpha.2-2079-g10c732f

wlin7 commented 3 years ago

Did 3 tests with Jan. 15 master (f723ff4), the results are consistent with earlier ones that used c0b0c77: NBFB between 4 nodes and 8 nodes PE., while BFB between the two hashes when using the same PE (4 or 8 nodes).

But there is another odd behavior, good or bad: The 3rd test with the Jan. 15 master used 2 nodes. It is BFB with the run using 4 nodes.

The 3 tests used F20TRC5-CMIP6 and ne30pg2_r05_oECv3. The run script mirrors that for alpha5_59, so the parameters have some differences from those in @worleyph 's runs. Didn't expect a certain atm nml setting could lead to such a behavior. But not to rule it out now given the BFB tests from Pat.

worleyph commented 3 years ago

@wlin7 , just compared my experiments with yours. Beyond the modifications in user_nl_eam, you also built with -cosp and I did not. I'll try again with COSP specified.

worleyph commented 3 years ago

Just adding -cosp made no difference (still BFB), but then also adding user_nl_eam from your case I finally see diffs. They start before "nstep, te 2" though:

  nstep, te        1   0.26280462287542272E+10   0.26280544394920521E+10   0.45402618135089546E-03   0.98530761010207716E+05
 chlorine_loading_advance: date, loading : 1850-01-01-01800,      0.457104
  nstep=          12  time=   3600.00000000000       [s]
   u     =  -0.692077522701391E+02 (  1)  0.152680773892293E+03 (  1)  0.513234954295678E+08
 ----
  nstep, te        1   0.26280462287542272E+10   0.26280544394920521E+10   0.45402618135089546E-03   0.98530761010207716E+05
 chlorine_loading_advance: date, loading : 1850-01-01-01800,      0.457104
  nstep=          12  time=   3600.00000000000       [s]
   u     =  -0.692077872626470E+02 (  1)  0.152680775525246E+03 (  1)  0.513236023100319E+08

(so in u).

So, something is being turned on by the user_nl_eam additions. I'll try this with other cases, just to see if a simpler case also shows this (F20TRC5-CMIP6 takes 10 minutes just reading in the input data; other F compsets don't take that long).

worleyph commented 3 years ago

Can reproduce the issue with

 -compset FC5AV1C-L -res ne30pg2_ne30pg2

using the user_nl_eam from @wlin7 's cases (after removing the history tape additions - not all fields are recognized for this compset). Next step is to determine which user_nl_eam modifications are relevant.

worleyph commented 3 years ago

On Compy (at least - haven't tried other systems) when using

 -compset FC5AV1C-L -res ne30pg2_ne30pg2 -compiler intel

320x1 and 160x1 PR layouts are not BFB (looking at atm.log) if user_nl_eam has the following namelist modifications:

  clubb_use_sgv          = .true.
  zmconv_tp_fac          = 2.0D0

(both are required).

I haven't yet tried looking at the code to figure out why.

worleyph commented 3 years ago

Neither 160x1 nor 80x1 are BFB with respect to changing load balancing (from 2 to 0) when

   clubb_use_sgv          = .true.
   zmconv_tp_fac          = 2.0D0

are added to user_nl_eam. And this is true for ne30 (as well as ne30pg2) as well. Trying ne4/ne4pg2 next.

worleyph commented 3 years ago

On Compy and using master, if add

    clubb_use_sgv          = .true.
    zmconv_tp_fac          = 2.0D0

to user_nl_eam, then both

 -compset FC5AV1C-L -res ne4_ne4 -project e3sm -compiler intel

and

 -compset FC5AV1C-L -res ne4pg2_ne4pg2 -project e3sm -compiler intel

for a 2x1 PE layout are not BFB with respect to changing phys_loadbalance from 2 to 0.

This may be easier to debug with than the case using ne30 or ne30pg2 and 320 and 160 processes.

worleyph commented 3 years ago

The same nonreproducibility wrt to changing phys_loadbalance occurs on Chrysalis as well (with the indicated changes to user_nl_eam). So this is not just a Compy problem.

Details:

 -compset FC5AV1C-L -res ne4_ne4 -mach chrysalis -compiler intel

2x1 PE layout, and user_nl_eam contains either

 phys_loadbalance = 2
 clubb_use_sgv          = .true.
 zmconv_tp_fac          = 2.0D0

or

 phys_loadbalance = 0
 clubb_use_sgv          = .true.
 zmconv_tp_fac          = 2.0D0

Note: I also checked that it is BFB wrt changing phys_loadbalance if I do not include the other user_nl_eam modifications.

rljacob commented 3 years ago

Why did you start changing phys_loadbalance? I thought all you need was to add the clubb_use_sgv, zmconv_tp_fac flags?

worleyph commented 3 years ago

Phys_loadbalance was how I was demonstrating nonBFB behavior. Much easier experiment to run than comparing two PE layouts. Same underlying problem being exposed.

Sent from my iPhone

On Jan 16, 2021, at 23:51, Robert Jacob notifications@github.com wrote:

Why did you start changing phys_loadbalance? I thought all you need was to add the clubb_use_sgv, zmconv_tp_fac flags?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

worleyph commented 3 years ago

And nonBFB behavior when changing phys_loadbalance (with the other two user_nl_eam mods fixed) can be demonstrated with ne4 and 2 processes. Changing PE layout from 80x1 to 160x1 for ne30 was still BFB; only 160x1 to 320x1 was not BFB. Changing phys_loadbalance always seems to demonstrate the problem.

Sent from my iPhone

On Jan 16, 2021, at 23:59, worleyph notifications@github.com wrote:

Phys_loadbalance was how I was demonstrating nonBFB behavior. Much easier experiment to run than comparing two PE layouts. Same underlying problem being exposed.

Sent from my iPhone

On Jan 16, 2021, at 23:51, Robert Jacob notifications@github.com wrote:

Why did you start changing phys_loadbalance? I thought all you need was to add the clubb_use_sgv, zmconv_tp_fac flags?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

rljacob commented 3 years ago

ok thanks.

PEM_PL_Ld5.ne30pg2_r05_EC30to60E2r2-1900_ICG.A_WCYCL1850S_CMIP6.chrysalis_intel.allactive-wcprod FAILs. That testmod adds many of the v2 settings including the 2 you found.

worleyph commented 3 years ago

Looking at the code, the default value of zmconv_tp_fac is zero, so setting it to

  zmconv_tp_fac          = 2.0D0

makes expressions such as

 tp_fac*tpert(i)

nonzero, and

 clubb_use_sgv          = .true.

controls how tpert(i) is calculated. So, the issue is likely in clubb_intr.F90 . Looking at this code, there are some "style" issues that should be corrected (may or not be related to the current issue), e.g. using real exponents unnecessarily?

  vmag(i)         = max(1.e-5_r8,sqrt( umb(i)**2._r8 + vmb(i)**2._r8))

(why not vmag(i) = max(1.e-5_r8,sqrt( umb(i)2 + vmb(i)2)) ?)

and perhaps applying _r8 to exponents unintentionally?

 1.e-3_r8

Guess that compiler sees 1.e-3 as the number (and not 3_r8)? Just looks funny at first glance.

rljacob commented 3 years ago

Changing load_balance and changing the task count will both change the horizontal decomposition. So it's odd that a piece of the non-decomposed column physics is triggering this.

If it was an optimization weirdness, I think you would see different numbers if you just compile and run twice without changing the decomposition.

worleyph commented 3 years ago

I agree, but I do get repeatable results from (many) identical cases. Haven't tried to examine this thoroughly though. Could be a real bug, but I haven't found one yet.

worleyph commented 3 years ago

I tried "expanding" the loop at line 2691 in clubb_intr.F90 into a separate loop for each variable, and then split

      tpert(i) = min(2._r8,(sqrt(thlp2(i,ktopi(i)))+(latvap/cpair)*state1%q(i,ktopi(i),ixcldliq)) &
                /max(state1%exner(i,ktopi(i)),1.e-3_r8)) !proxy for tpert

into many steps, each with its own pcols-indexed temporary, and the numerics did not change at all. I am less inclined to blame the compiler (for this loop). Next thing is to check whether there is data that is being used in this loop that is coming in "non-BFB".

Don't know if @singhbalwinder 's tool would be useful for tracking this down from here.

singhbalwinder commented 3 years ago

It is great that Pat was able to reproduce it with the ne4 grid, it will make it much easier to debug. I ran the pergro test using Pat's reproducer on Compy using Intel:

phys_loadbalance = 2
 clubb_use_sgv          = .true.
 zmconv_tp_fac          = 2.0D0

or

 phys_loadbalance = 0
 clubb_use_sgv          = .true.
 zmconv_tp_fac          = 2.0D0

The "nbfb" is coming from the ZM scheme (zm_convr physics update) and I can see it in temperature, static energy and water vapor. It starts to effect other variables (e.g. num_a4) after CLUBB is called. I will see if I can find the exact line causing the diff.

worleyph commented 3 years ago

Great, @singhbalwinder . Thanks. I'd run out of easy things to try. I'll leave this to you now.

wlin7 commented 3 years ago

Excellent sleuthing working, thanks Pat. I was not paying attention to this thread. and also came down to these two parameters by bisecting the additional parameters that were added to my run (and would be used for v2). Thank you Balwinder for debugging this as well. The changes were introduced as part of the v1p tuning.

rljacob commented 3 years ago

We know what 2 parameters lead to non-BFB behavior but we don't know why. It makes no sense that some setting in the column physics would cause different answers when you change horizontal decompositions. And we can't allow that kind of non-BFB behavior so either those param settings have to be removed (a workaround) or we have to find/fix the root problem.

ambrad commented 3 years ago

A guess is that there is uninitialized memory somewhere; different decomps or physics column sets (from phys_loadbalance) lead to different answers based on the data in the uninit'ed memory.

worleyph commented 3 years ago

I ran the 2 process reproducer on Chrysalis built with DEBUG=TRUE and it completed successfully (2 nsteps) and phys_loadbalance=0 and phys_loadbalance=2 were NOT BFB. The compiler flags for DEBUG include ...

 -O0 -g -check uninit -check bounds -check pointers -fpe0 -check noarg_temp_created

so, the compiler did not find any uninitialized memory (if that is what '-check uninit' does).

ambrad commented 3 years ago

Based on this Intel page, it may be worth adding -init=snan,arrays to the flags list. It's also possible valgrind will see more than these compiler checks alone, although valgrind tends to produce false positives, as well, particularly when vector instructions are used.

worleyph commented 3 years ago

@singhbalwinder 's perturbation growth test infrastructure is pretty effective at chasing down these sorts of issues. I'll give this a try when I get th echance, but I expect Balwinder to find the source quickly.

worleyph commented 3 years ago

@ambrad , I added '-init=snan,arrays', and the run failed. Unfortunately, it appears to be due to an unrelated issue, in particular because it fails in an identical way whether I include the modifications to user_nl_eam or not.

 [0] [chr-0061:2700510:0:2700510] Caught signal 8 (Floating point exception: floating-point invalid operation)
 [0] ==== backtrace (tid:2700510) ====
 [0]  0 0x0000000000055799 ucs_debug_print_backtrace()  ???:0
 [0]  1 0x0000000000012dd0 .annobin_sigaction.c()  sigaction.c:0
 [0]  2 0x0000000008ec9f8a ice_grid_mp_makemask_()  /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/components/cice/src/source/ice_grid.F90:1674
 [0]  3 0x0000000008ebac57 ice_grid_mp_latlongrid_()  /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/components/cice/src/source/ice_grid.F90:1223
 [0]  4 0x0000000008e980a3 ice_grid_mp_init_grid2_()  /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/components/cice/src/source/ice_grid.F90:338
 [0]  5 0x00000000093d5b9a cice_initmod_mp_cice_init_()  /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/components/cice/src/drivers/cpl/CICE_InitMod.F90:109
 [0]  6 0x0000000008ad9490 ice_comp_mct_mp_ice_init_mct_()  /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/components/cice/src/drivers/cpl/ice_comp_mct.F90:240
 [0]  7 0x0000000000482f6b component_mod_mp_component_init_cc_()  /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/cime/src/drivers/mct/main/component_mod.F90:257
 [0]  8 0x000000000042f6af cime_comp_mod_mp_cime_init_()  /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/cime/src/drivers/mct/main/cime_comp_mod.F90:1439
 [0] =================================
 [0] forrtl: error (75): floating point exception
 [0] Image              PC                Routine            Line        Source
 [0] libpnetcdf.so.3.0  0000155550C657BC  for__signal_handl     Unknown  Unknown
 [0] libpthread-2.28.s  000015554D622DD0  Unknown               Unknown  Unknown
 [0] e3sm.exe           0000000008EC9F8A  ice_grid_mp_makem        1674  ice_grid.F90
 [0] e3sm.exe           0000000008EBAC57  ice_grid_mp_latlo        1223  ice_grid.F90
 [0] e3sm.exe           0000000008E980A3  ice_grid_mp_init_         338  ice_grid.F90
 [0] e3sm.exe           00000000093D5B9A  cice_initmod_mp_c         109  CICE_InitMod.F90
 [0] e3sm.exe           0000000008AD9490  ice_comp_mct_mp_i         240  ice_comp_mct.F90
 [0] e3sm.exe           0000000000482F6B  component_mod_mp_         257  component_mod.F90
 [0] e3sm.exe           000000000042F6AF  cime_comp_mod_mp_        1439  cime_comp_mod.F90
 [0] e3sm.exe           0000000000479A74  MAIN__                    122  cime_driver.F90
 [0] e3sm.exe           000000000041C722  Unknown               Unknown  Unknown
 [0] libc-2.28.so       000015554CCD76A3  __libc_start_main     Unknown  Unknown
 [0] e3sm.exe           000000000041C62E  Unknown               Unknown  Unknown

ambrad commented 3 years ago

This is in CICE since this is an F-compset, so it's probably not of much interest. I speculate that this is an invalid but also inert operation: the part of the mesh containing evidently unit'ed data is probably not used during time stepping. My approach to this sort of thing, when chasing down uninit'ed memory elsewhere, is that I put in NaN checks into the failing but irrelevant code to get past it. Another approach would be to add the '-init=snan,arrays' flag to just the relevant translation units, e.g., the atm. Finally, sometimes valgrind is more useful because it flags all uninit'ed memory and continues to run rather than halting on the first error.

E3SM-Project / E3SM

BFBFLAG=True still gives NBFB results with different PElayout on compy #4038