Closed wlin7 closed 3 years ago
Hi @jonbob , can you please point me to the test that is designed to check BFB with different. PE layouts? Thanks.
@wlin7 - sure. Here's an example:
./create_test PEM_P480_Ld5.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.anvil_intel
The "PEM" prefix is described as:
PEM modified pe counts mpi bfb test (seq tests) do an initial run with default pe layout (suffix: base) do another initial run with modified pes (NTASKS_XXX => NTASKS_XXX/2) (suffix: modpes) compare base and single_thread
The P480 just indicates the number of pes for the initial layout and is not required. The Ld5 specifies that the test will run five-days but is also not required. The rest of the test naming is the typical grid.compset.machine_compiler. But please let me know if I can help make more sense of it
Thanks, @jonbob . It appears this test has been failing since at least last October.
@wlin7 - that one was perhaps a bad example. I cut-and-pasted it because I had just run it testing a PR that fixed the particular case, although those fails are due to the _ISMF coding in mpas-o and not compy or its compiler. I meant it only as an example of setting up a PEM test. You would want to try something more like:
PEM_P3600_Ld5.ne30pg2_r05_EC30to60E2r2-1900_ICG.A_WCYCL1850S_CMIP6.compy_intel
to test your exact issue
Just noting that ERP tests will also change the task count. Here are the tests that have been passing (on sandiatoss3 but they pass on other machines we run them: bebop, anvil, theta) ERP_Ld3.ne30_oECv3_ICG.A_WCYCL1850S.sandiatoss3_intel.allactive-pioroot1 ERP_Ln9.ne4_ne4.FC5AV1C-L.sandiatoss3_intel ERP_Ln9.ne4_ne4.F-EAMv1-RCEMIP.sandiatoss3_intel ERP_Ln9.ne4_ne4.F-EAMv1-AQP1.sandiatoss3_intel
So it must have something to do with the resolution or the CMIP6 compset.
@wlin7 : How long does it take to show the differences? Do they show up right after the first time step?
@singhbalwinder , for my tests with PE=M vs PE=L on compy, global stats (in atrm.log) start to differ from step 2.
So this should be reproducible with an F case then? Might make diagnosis easier.
I was going to say the same thing. @wlin7 : Have you tried an F case yet to see if this is reproducible with an F case?
Good Idea, @worleyph , @singhbalwinder . I haven't tried, doing it now.
If ne4 test is passing but ne30 is not, one of the differences among them is the time step. Time step also drives how many times radiation is called. If you can confirm that you see this with an F case as well, I can run a test to find out which parameterization is causing the difference.
Reporting back: NBFB also for F20TRC5-CMIP6 starting step 2, grid ne30pg2_r05_oECv3. One run using 4 nodes, the other 8 nodes.
I can run a test to find out which parameterization is causing the difference.
That would be great, Balwinder. Thanks.
@wlin7 c0b0c779bbf67 is from Dec 24. You should confirm this happens with latest master.
Also the BFBFLAG is not a make-everything-BFB flag. It only addresses interpolation in the coupler. Lots of other ways to break BFB when changing processor count.
I don't see a problem (yet) with master using FC5AV1C-L and ne30pg2_ne30pg2 (160x1 compared to 80x1 and also compared phys_loadbalance=0 and phys_loadbalance=2).
Also the BFBFLAG is not a make-everything-BFB flag. It only addresses interpolation in the coupler. Lots of other ways to break BFB when changing processor count.
That is true, Rob. Initially I thought the problem only occur with B case and the first thing I would check in that case is. BFBFLAG. That title now appears misleading.
Also PEM_P480_Ld5.T62_oEC60to30v3wLI.GMPAS-DIB-IAF-ISMF.compy_intel failed from the latest report (Jan. 9).
That test was failing for a while and the fix was just merged to master on Wed. PR #4025
@rljacob , this issue emerged while testing forPR 4007, which branched off master of Dec. 24. I used existing baseline tests from them for comparison. We would be really lucky if the issue had gone away in current master. That said, I am going to run a pair of tests on current master with that particular compset and grid.
That test was failing for a while and the fix was just merged to master on Wed. PR #4025
Oh, thanks. This current issue is a completely different one, then.
PEM_PL_Ld5.ne30pg2_r05_EC30to60E2r2-1900_ICG.A_WCYCL1850S_CMIP6.chrysalis_intel PASS using 0ad588d81 (Jan 7 master)
Tried 1 day of
-compset F20TRC5-CMIP6 -res ne30pg2_r05_oECv3
for 4 nodes (160x1) and 8 nodes (320x1), and they were BFB with respect to the atm.log. So, I can't reproduce this issue.
@worleyph what hash, what machine?
PEM_Ld5.ne30pg2_r05_EC30to60E2r2-1900_ICG.A_WCYCL1850S_CMIP6 also passes on anvil (0ad588d)
A reset my local master to c0b0c77 and PEM_PL_Ld5.ne30pg2_r05_EC30to60E2r2-1900_ICG.A_WCYCL1850S_CMIP6.chrysalis_intel still PASSes so this is at worst a compy problem.
Compy, master (updated today), intel compiler
-compset F20TRC5-CMIP6 -res ne30pg2_r05_oECv3 -project e3sm -compiler intel
$ git describe
v2.0.0-alpha.2-2079-g10c732f
Did 3 tests with Jan. 15 master (f723ff4), the results are consistent with earlier ones that used c0b0c77: NBFB between 4 nodes and 8 nodes PE., while BFB between the two hashes when using the same PE (4 or 8 nodes).
But there is another odd behavior, good or bad: The 3rd test with the Jan. 15 master used 2 nodes. It is BFB with the run using 4 nodes.
The 3 tests used F20TRC5-CMIP6 and ne30pg2_r05_oECv3. The run script mirrors that for alpha5_59, so the parameters have some differences from those in @worleyph 's runs. Didn't expect a certain atm nml setting could lead to such a behavior. But not to rule it out now given the BFB tests from Pat.
@wlin7 , just compared my experiments with yours. Beyond the modifications in user_nl_eam, you also built with -cosp and I did not. I'll try again with COSP specified.
Just adding -cosp made no difference (still BFB), but then also adding user_nl_eam from your case I finally see diffs. They start before "nstep, te 2" though:
nstep, te 1 0.26280462287542272E+10 0.26280544394920521E+10 0.45402618135089546E-03 0.98530761010207716E+05
chlorine_loading_advance: date, loading : 1850-01-01-01800, 0.457104
nstep= 12 time= 3600.00000000000 [s]
u = -0.692077522701391E+02 ( 1) 0.152680773892293E+03 ( 1) 0.513234954295678E+08
----
nstep, te 1 0.26280462287542272E+10 0.26280544394920521E+10 0.45402618135089546E-03 0.98530761010207716E+05
chlorine_loading_advance: date, loading : 1850-01-01-01800, 0.457104
nstep= 12 time= 3600.00000000000 [s]
u = -0.692077872626470E+02 ( 1) 0.152680775525246E+03 ( 1) 0.513236023100319E+08
(so in u).
So, something is being turned on by the user_nl_eam additions. I'll try this with other cases, just to see if a simpler case also shows this (F20TRC5-CMIP6 takes 10 minutes just reading in the input data; other F compsets don't take that long).
Can reproduce the issue with
-compset FC5AV1C-L -res ne30pg2_ne30pg2
using the user_nl_eam from @wlin7 's cases (after removing the history tape additions - not all fields are recognized for this compset). Next step is to determine which user_nl_eam modifications are relevant.
On Compy (at least - haven't tried other systems) when using
-compset FC5AV1C-L -res ne30pg2_ne30pg2 -compiler intel
320x1 and 160x1 PR layouts are not BFB (looking at atm.log) if user_nl_eam has the following namelist modifications:
clubb_use_sgv = .true.
zmconv_tp_fac = 2.0D0
(both are required).
I haven't yet tried looking at the code to figure out why.
Neither 160x1 nor 80x1 are BFB with respect to changing load balancing (from 2 to 0) when
clubb_use_sgv = .true.
zmconv_tp_fac = 2.0D0
are added to user_nl_eam. And this is true for ne30 (as well as ne30pg2) as well. Trying ne4/ne4pg2 next.
On Compy and using master, if add
clubb_use_sgv = .true.
zmconv_tp_fac = 2.0D0
to user_nl_eam, then both
-compset FC5AV1C-L -res ne4_ne4 -project e3sm -compiler intel
and
-compset FC5AV1C-L -res ne4pg2_ne4pg2 -project e3sm -compiler intel
for a 2x1 PE layout are not BFB with respect to changing phys_loadbalance from 2 to 0.
This may be easier to debug with than the case using ne30 or ne30pg2 and 320 and 160 processes.
The same nonreproducibility wrt to changing phys_loadbalance occurs on Chrysalis as well (with the indicated changes to user_nl_eam). So this is not just a Compy problem.
Details:
-compset FC5AV1C-L -res ne4_ne4 -mach chrysalis -compiler intel
2x1 PE layout, and user_nl_eam contains either
phys_loadbalance = 2
clubb_use_sgv = .true.
zmconv_tp_fac = 2.0D0
or
phys_loadbalance = 0
clubb_use_sgv = .true.
zmconv_tp_fac = 2.0D0
Note: I also checked that it is BFB wrt changing phys_loadbalance if I do not include the other user_nl_eam modifications.
Why did you start changing phys_loadbalance? I thought all you need was to add the clubb_use_sgv, zmconv_tp_fac flags?
Phys_loadbalance was how I was demonstrating nonBFB behavior. Much easier experiment to run than comparing two PE layouts. Same underlying problem being exposed.
Sent from my iPhone
On Jan 16, 2021, at 23:51, Robert Jacob notifications@github.com wrote:
Why did you start changing phys_loadbalance? I thought all you need was to add the clubb_use_sgv, zmconv_tp_fac flags?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
And nonBFB behavior when changing phys_loadbalance (with the other two user_nl_eam mods fixed) can be demonstrated with ne4 and 2 processes. Changing PE layout from 80x1 to 160x1 for ne30 was still BFB; only 160x1 to 320x1 was not BFB. Changing phys_loadbalance always seems to demonstrate the problem.
Sent from my iPhone
On Jan 16, 2021, at 23:59, worleyph notifications@github.com wrote:
Phys_loadbalance was how I was demonstrating nonBFB behavior. Much easier experiment to run than comparing two PE layouts. Same underlying problem being exposed.
Sent from my iPhone
On Jan 16, 2021, at 23:51, Robert Jacob notifications@github.com wrote:
Why did you start changing phys_loadbalance? I thought all you need was to add the clubb_use_sgv, zmconv_tp_fac flags?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
ok thanks.
PEM_PL_Ld5.ne30pg2_r05_EC30to60E2r2-1900_ICG.A_WCYCL1850S_CMIP6.chrysalis_intel.allactive-wcprod FAILs. That testmod adds many of the v2 settings including the 2 you found.
Looking at the code, the default value of zmconv_tp_fac is zero, so setting it to
zmconv_tp_fac = 2.0D0
makes expressions such as
tp_fac*tpert(i)
nonzero, and
clubb_use_sgv = .true.
controls how tpert(i) is calculated. So, the issue is likely in clubb_intr.F90 . Looking at this code, there are some "style" issues that should be corrected (may or not be related to the current issue), e.g. using real exponents unnecessarily?
vmag(i) = max(1.e-5_r8,sqrt( umb(i)**2._r8 + vmb(i)**2._r8))
(why not vmag(i) = max(1.e-5_r8,sqrt( umb(i)2 + vmb(i)2)) ?)
and perhaps applying _r8 to exponents unintentionally?
1.e-3_r8
Guess that compiler sees 1.e-3 as the number (and not 3_r8)? Just looks funny at first glance.
Changing load_balance and changing the task count will both change the horizontal decomposition. So it's odd that a piece of the non-decomposed column physics is triggering this.
If it was an optimization weirdness, I think you would see different numbers if you just compile and run twice without changing the decomposition.
I agree, but I do get repeatable results from (many) identical cases. Haven't tried to examine this thoroughly though. Could be a real bug, but I haven't found one yet.
I tried "expanding" the loop at line 2691 in clubb_intr.F90 into a separate loop for each variable, and then split
tpert(i) = min(2._r8,(sqrt(thlp2(i,ktopi(i)))+(latvap/cpair)*state1%q(i,ktopi(i),ixcldliq)) &
/max(state1%exner(i,ktopi(i)),1.e-3_r8)) !proxy for tpert
into many steps, each with its own pcols-indexed temporary, and the numerics did not change at all. I am less inclined to blame the compiler (for this loop). Next thing is to check whether there is data that is being used in this loop that is coming in "non-BFB".
Don't know if @singhbalwinder 's tool would be useful for tracking this down from here.
It is great that Pat was able to reproduce it with the ne4 grid, it will make it much easier to debug. I ran the pergro test using Pat's reproducer on Compy using Intel:
phys_loadbalance = 2
clubb_use_sgv = .true.
zmconv_tp_fac = 2.0D0
or
phys_loadbalance = 0
clubb_use_sgv = .true.
zmconv_tp_fac = 2.0D0
The "nbfb" is coming from the ZM scheme (zm_convr physics update) and I can see it in temperature, static energy and water vapor. It starts to effect other variables (e.g. num_a4) after CLUBB is called. I will see if I can find the exact line causing the diff.
Great, @singhbalwinder . Thanks. I'd run out of easy things to try. I'll leave this to you now.
Excellent sleuthing working, thanks Pat. I was not paying attention to this thread. and also came down to these two parameters by bisecting the additional parameters that were added to my run (and would be used for v2). Thank you Balwinder for debugging this as well. The changes were introduced as part of the v1p tuning.
We know what 2 parameters lead to non-BFB behavior but we don't know why. It makes no sense that some setting in the column physics would cause different answers when you change horizontal decompositions. And we can't allow that kind of non-BFB behavior so either those param settings have to be removed (a workaround) or we have to find/fix the root problem.
A guess is that there is uninitialized memory somewhere; different decomps or physics column sets (from phys_loadbalance) lead to different answers based on the data in the uninit'ed memory.
I ran the 2 process reproducer on Chrysalis built with DEBUG=TRUE and it completed successfully (2 nsteps) and phys_loadbalance=0 and phys_loadbalance=2 were NOT BFB. The compiler flags for DEBUG include ...
-O0 -g -check uninit -check bounds -check pointers -fpe0 -check noarg_temp_created
so, the compiler did not find any uninitialized memory (if that is what '-check uninit' does).
Based on this Intel page, it may be worth adding -init=snan,arrays
to the flags list. It's also possible valgrind will see more than these compiler checks alone, although valgrind tends to produce false positives, as well, particularly when vector instructions are used.
@singhbalwinder 's perturbation growth test infrastructure is pretty effective at chasing down these sorts of issues. I'll give this a try when I get th echance, but I expect Balwinder to find the source quickly.
@ambrad , I added '-init=snan,arrays', and the run failed. Unfortunately, it appears to be due to an unrelated issue, in particular because it fails in an identical way whether I include the modifications to user_nl_eam or not.
[0] [chr-0061:2700510:0:2700510] Caught signal 8 (Floating point exception: floating-point invalid operation)
[0] ==== backtrace (tid:2700510) ====
[0] 0 0x0000000000055799 ucs_debug_print_backtrace() ???:0
[0] 1 0x0000000000012dd0 .annobin_sigaction.c() sigaction.c:0
[0] 2 0x0000000008ec9f8a ice_grid_mp_makemask_() /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/components/cice/src/source/ice_grid.F90:1674
[0] 3 0x0000000008ebac57 ice_grid_mp_latlongrid_() /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/components/cice/src/source/ice_grid.F90:1223
[0] 4 0x0000000008e980a3 ice_grid_mp_init_grid2_() /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/components/cice/src/source/ice_grid.F90:338
[0] 5 0x00000000093d5b9a cice_initmod_mp_cice_init_() /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/components/cice/src/drivers/cpl/CICE_InitMod.F90:109
[0] 6 0x0000000008ad9490 ice_comp_mct_mp_ice_init_mct_() /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/components/cice/src/drivers/cpl/ice_comp_mct.F90:240
[0] 7 0x0000000000482f6b component_mod_mp_component_init_cc_() /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/cime/src/drivers/mct/main/component_mod.F90:257
[0] 8 0x000000000042f6af cime_comp_mod_mp_cime_init_() /gpfs/fs1/home/ac.worleyph/E3SM/master/E3SM/cime/src/drivers/mct/main/cime_comp_mod.F90:1439
[0] =================================
[0] forrtl: error (75): floating point exception
[0] Image PC Routine Line Source
[0] libpnetcdf.so.3.0 0000155550C657BC for__signal_handl Unknown Unknown
[0] libpthread-2.28.s 000015554D622DD0 Unknown Unknown Unknown
[0] e3sm.exe 0000000008EC9F8A ice_grid_mp_makem 1674 ice_grid.F90
[0] e3sm.exe 0000000008EBAC57 ice_grid_mp_latlo 1223 ice_grid.F90
[0] e3sm.exe 0000000008E980A3 ice_grid_mp_init_ 338 ice_grid.F90
[0] e3sm.exe 00000000093D5B9A cice_initmod_mp_c 109 CICE_InitMod.F90
[0] e3sm.exe 0000000008AD9490 ice_comp_mct_mp_i 240 ice_comp_mct.F90
[0] e3sm.exe 0000000000482F6B component_mod_mp_ 257 component_mod.F90
[0] e3sm.exe 000000000042F6AF cime_comp_mod_mp_ 1439 cime_comp_mod.F90
[0] e3sm.exe 0000000000479A74 MAIN__ 122 cime_driver.F90
[0] e3sm.exe 000000000041C722 Unknown Unknown Unknown
[0] libc-2.28.so 000015554CCD76A3 __libc_start_main Unknown Unknown
[0] e3sm.exe 000000000041C62E Unknown Unknown Unknown
This is in CICE since this is an F-compset, so it's probably not of much interest. I speculate that this is an invalid but also inert operation: the part of the mesh containing evidently unit'ed data is probably not used during time stepping. My approach to this sort of thing, when chasing down uninit'ed memory elsewhere, is that I put in NaN checks into the failing but irrelevant code to get past it. Another approach would be to add the '-init=snan,arrays' flag to just the relevant translation units, e.g., the atm. Finally, sometimes valgrind is more useful because it flags all uninit'ed memory and continues to run rather than halting on the first error.
BFBFLAG=True is expected to give BFB reproducibility with different PElayout. It is not the case with a recent master running on compy and intel compiler.
The problem can be produced with following code and configuration
The simulations were done using the following two run scripts (the 1ts one uses 90 nodes -- PE=L, the 2nd 46 nodes -- PE=M, all MPI)