E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
334 stars 334 forks source link

Fatal error in PMPI_Waitall in DEBUG (with threads) using ne11 F-case with Eulerian transport and theta-l in H mode (cori-knl) #3628

Closed ndkeen closed 3 years ago

ndkeen commented 3 years ago

This issue started with a problem I encountered in scream, but we eventually ruled out scream code and can reproduce the problem in E3SM. It is only a problem in DEBUG and only a problem with threading. And, there were some PE layouts that did work, but most fail in the same way. It uses theta-l in hydrostatic mode, Eulerian transport. SO far, I've only tried on cori-knl.

I'm using a master of June 1st.

res=ne11_ne11
compset=FC5AV1C-L

./xmlchange CAM_TARGET=theta-l
./xmlchange DEBUG=true

cat <<EOF >> user_nl_cam                                                                                             
  semi_lagrange_cdr_alg = 3                                                                                          
  theta_hydrostatic_mode = .true.                                                                                    
  tstep_type = 5                                                                                                     
  state_debug_checks = .true.                                                                                        
  dtime = 300                                                                                                        
  se_tstep = 50                                                                                                      
  transport_alg = 0                                                                                                  
  dt_tracer_factor = 1                                                                                               
  hypervis_subcycle_q = 1                                                                                            
EOF                                                                                                                  

I'm running with stacked PE layouts and 64 MPI's per node. It has always run OK with 1 thread.

 56: Rank 56 [Mon Jun  8 11:14:22 2020] [c1-1c1s15n1] Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code
 56: forrtl: error (76): Abort trap signal
 56: Image              PC                Routine            Line        Source             
 56: e3sm.exe           000000000CE8EB0E  Unknown               Unknown  Unknown
 56: e3sm.exe           000000000C6B4C20  Unknown               Unknown  Unknown
 56: e3sm.exe           000000000CE61F50  Unknown               Unknown  Unknown
 56: e3sm.exe           000000000CF8CB41  Unknown               Unknown  Unknown
 56: e3sm.exe           000000000C85B552  Unknown               Unknown  Unknown
 56: e3sm.exe           000000000C83FE00  Unknown               Unknown  Unknown
 56: e3sm.exe           000000000C83FF46  Unknown               Unknown  Unknown
 56: e3sm.exe           000000000C82E3A5  Unknown               Unknown  Unknown
 56: e3sm.exe           000000000C82F453  Unknown               Unknown  Unknown
 56: e3sm.exe           000000000382640F  bndry_mod_base_mp         142  bndry_mod_base.F90
 56: e3sm.exe           0000000003830151  bndry_mod_base_mp         533  bndry_mod_base.F90
 56: e3sm.exe           0000000003C9DB22  prim_advance_mod_        1034  prim_advance_mod.F90
 56: e3sm.exe           0000000003C6ACD1  prim_advance_mod_         394  prim_advance_mod.F90
 56: e3sm.exe           0000000003AB3029  prim_driver_base_        1222  prim_driver_base.F90
 56: e3sm.exe           0000000003AB28B7  prim_driver_base_        1082  prim_driver_base.F90
 56: e3sm.exe           0000000004DECBF9  dyn_comp_mp_dyn_r         375  dyn_comp.F90
 56: e3sm.exe           000000000B45B643  Unknown               Unknown  Unknown
 56: e3sm.exe           000000000B412F20  Unknown               Unknown  Unknown
 56: e3sm.exe           000000000B414514  Unknown               Unknown  Unknown
 56: e3sm.exe           000000000B3E5E34  Unknown               Unknown  Unknown
 56: e3sm.exe           0000000004DEC610  dyn_comp_mp_dyn_r         360  dyn_comp.F90
 56: e3sm.exe           0000000003824464  stepon_mp_stepon_         522  stepon.F90
 56: e3sm.exe           0000000000850F5D  cam_comp_mp_cam_r         329  cam_comp.F90
 56: e3sm.exe           0000000000820489  atm_comp_mct_mp_a         558  atm_comp_mct.F90

example case:
/global/cscratch1/sd/ndk/e3sm_scratch/cori-knl/m34-jun1/f11.FC5AV1C-L.ne11_ne11.m34-jun1.n002p0128t64x2.1d.nr.pio2.intel.DEBUG.eul.hydro5

I can repeat with ne16 and ne30, but it seems to run with ne4 (using otherwise same parameters).

I also tried using more MPI's and nodes -- 363 MPI's on 6 nodes still stopped in the same way.

Suggestion on how to simplify further to rule something out?

PeterCaldwell commented 3 years ago

Is #3629 related to this?

ndkeen commented 3 years ago

Well, I am also trying these cases with GNU DEBUG to see if I can repeat (that PR only affects GNU DEBUG builds). I have been removing the -Og flag when I do those tests as work-around and thought I should try getting this into the repo to make it easier for others.

I did just try the same case with GNU DEBUG (after removing -Og) and it does not stop (keeps running ATM steps until out of time). However, when I update GNU to use the latest version (9.2), I get a failure. But it's not the same failure and even happens with 1 thread.


  1: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  1: 
  1: Backtrace for this error:
  1: #0  0x425132f in ???
  1:    at /home/abuild/rpmbuild/BUILD/glibc-2.26/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
  1: #1  0x1f72769 in __micro_mg2_0_MOD_micro_mg_tend
  1:    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/cam/src/physics/cam/micro_mg2_0.F90:1673
  1: #2  0x1bbcdbb in __micro_mg_cam_MOD_micro_mg_cam_tend
  1:    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/cam/src/physics/cam/micro_mg_cam.F90:2247
  1: #3  0x10d98fa in __microp_driver_MOD_microp_driver_tend
  1:    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/cam/src/physics/cam/microp_driver.F90:183
  1: #4  0x12100d9 in tphysbc
  1:    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/cam/src/physics/cam/physpkg.F90:2620
  1: #5  0x1223984 in __physpkg_MOD_phys_run1
  1:    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/cam/src/physics/cam/physpkg.F90:1054
  1: #6  0x5eaecf in __cam_comp_MOD_cam_run1
  1:    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/cam/src/control/cam_comp.F90:250
  1: #7  0x5de425 in __atm_comp_mct_MOD_atm_init_mct
  1:    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/cam/src/cpl/atm_comp_mct.F90:394
  1: #8  0x441062 in __component_mod_MOD_component_init_cc
  1:    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/cime/src/drivers/mct/main/component_mod.F90:248
  1: #9  0x42d527 in __cime_comp_mod_MOD_cime_init
  1:    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/cime/src/drivers/mct/main/cime_comp_mod.F90:2205
  1: #10  0x4392d7 in cime_driver
  1:    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/cime/src/drivers/mct/main/cime_driver.F90:122
  1: #11  0x4393a1 in main
  1:    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/cime/src/drivers/mct/main/cime_driver.F90:23
c```
ndkeen commented 3 years ago

I also tried with impi instead of mpich. It times out, but it looks hung in the same Waitall as original problem. I can ask the program where it's at and this is what I see:

#0  0x00002aaab96563a9 in _gnix_vc_nic_progress () from /global/common/software/m3169/cori/libfabric/1.6.2/lib/libfabric.so.1
#1  0x00002aaab963c76b in _gnix_nic_progress () from /global/common/software/m3169/cori/libfabric/1.6.2/lib/libfabric.so.1
#2  0x00002aaab963fe32 in _gnix_prog_progress () from /global/common/software/m3169/cori/libfabric/1.6.2/lib/libfabric.so.1
#3  0x00002aaab960a736 in __gnix_cq_sreadfrom.isra.0 () from /global/common/software/m3169/cori/libfabric/1.6.2/lib/libfabric.so.1
#4  0x00002aaab3b707b3 in fi_cq_read (cq=0x135127a8, buf=<optimized out>, count=<optimized out>) at /usr/include/rdma/fi_eq.h:385
#5  MPIDI_OFI_progress (vci=324085672, blocking=8325) at ../../src/mpid/ch4/netmod/ofi/ofi_progress.c:33
#6  0x00002aaab369490d in MPIDI_Progress_test (flags=324085672, req=0x2085) at ../../src/mpid/ch4/src/ch4_progress.c:179
#7  0x00002aaab3694696 in MPID_Progress_test (req=<optimized out>) at ../../src/mpid/ch4/src/ch4_progress.c:231
#8  MPID_Progress_wait (state=0x135127a8, req=0x2085) at ../../src/mpid/ch4/src/ch4_progress.c:293
#9  0x00002aaab3d1411a in PMPI_Waitall (count=324085672, array_of_requests=0x2085, array_of_statuses=0x1) at ../../src/mpi/request/waitall.c:46
#10 0x00002aaab320cb10 in pmpi_waitall_ (v1=0x135127a8, v2=0x2085, v3=0x1, ierr=0x10000000000007d) at ../../src/binding/fortran/mpif_h/waitallf.c:275
#11 0x000000000388bfa7 in bndry_mod_base::bndry_exchangev_core (par=..., ithr=0, buffer=...) at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/homme/src/share/bndry_mod_base.F90:142
#12 0x0000000003895ce9 in bndry_mod_base::bndry_exchangev_threaded (hybrid=..., buffer=...) at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/homme/src/share/bndry_mod_base.F90:533
#13 0x0000000003d0c576 in prim_advance_mod::advance_hypervis (elem=<error reading variable: value requires 11470464 bytes, which is more than max-value-size>, hvcoord=..., hybrid=..., deriv=..., nt=2, nets=1, nete=3, dt2=50, eta_ave_w=1)
    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/homme/src/theta-l/prim_advance_mod.F90:1034
#14 0x0000000003cd9725 in prim_advance_mod::prim_advance_exp (elem=<error reading variable: value requires 11470464 bytes, which is more than max-value-size>, deriv=..., hvcoord=..., hybrid=..., dt=50, tl=..., nets=1, nete=3, compute_diagnostics=4294967295)
    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/homme/src/theta-l/prim_advance_mod.F90:394
#15 0x0000000003b1c0e8 in prim_driver_base::prim_step (elem=<error reading variable: value requires 11470464 bytes, which is more than max-value-size>, hybrid=..., nets=1, nete=3, dt=50, tl=..., hvcoord=..., compute_diagnostics=4294967295)
    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/homme/src/share/prim_driver_base.F90:1222
#16 0x0000000003b1b976 in prim_driver_base::prim_run_subcycle (elem=<error reading variable: value requires 11470464 bytes, which is more than max-value-size>, hybrid=..., nets=1, nete=3, dt=50, single_column=.FALSE., tl=..., hvcoord=..., nsubstep=2)
    at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/homme/src/share/prim_driver_base.F90:1082
#17 0x0000000004b4649b in dyn_comp::L_dyn_comp_mp_dyn_run__360__par_region0_2_4 () at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/cam/src/dynamics/se/dyn_comp.F90:375
#18 0x00002aaab281f9f3 in __kmp_invoke_microtask () from /opt/intel/compilers_and_libraries_2019.3.199/linux/compiler/lib/intel64/libiomp5.so
#19 0x00002aaab27dff96 in __kmp_invoke_task_func (gtid=324085672) at ../../src/kmp_runtime.cpp:7540
#20 0x00002aaab27e188b in __kmp_fork_call (loc=0x135127a8, gtid=8325, call_context=fork_context_intel, argc=125, microtask=0x2aaab990d574 <GNII_debug_level>, invoker=0xc, ap=0x7fffffdc1710) at ../../src/kmp_runtime.cpp:2529
#21 0x00002aaab27a0d60 in __kmpc_fork_call (loc=0x135127a8, argc=8325, microtask=0x1) at ../../src/kmp_csupport.cpp:365
#22 0x0000000004b45d9e in dyn_comp::dyn_run (dyn_state=..., rc=0) at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/cam/src/dynamics/se/dyn_comp.F90:360
#23 0x0000000003889ffd in stepon::stepon_run3 (dtime=300, cam_out=..., phys_state=..., dyn_in=..., dyn_out=...) at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/cam/src/dynamics/se/stepon.F90:522
#24 0x00000000008759cd in cam_comp::cam_run3 (cam_out=...) at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/cam/src/control/cam_comp.F90:329
#25 0x00000000008447de in atm_comp_mct::atm_run_mct (eclock=..., cdata_a=..., x2a_a=..., a2x_a=...) at /global/cscratch1/sd/ndk/wacmy/m34-jun1/components/cam/src/cpl/atm_comp_mct.F90:558
#26 0x000000000047b6a5 in component_mod::component_run (eclock=..., comp=..., infodata=..., seq_flds_x2c_fluxes=..., seq_flds_c2x_fluxes=..., comp_prognostic=4294967295, comp_num=1, timer_barrier=..., timer_comp_run=..., run_barriers=.FALSE., ymd=10101, tod=7200, comp_layout=..., .tmp.SEQ_FLDS_X2C_FLUXES.len_V$31b4=4096, 
    .tmp.SEQ_FLDS_C2X_FLUXES.len_V$31b7=4096, .tmp.TIMER_BARRIER.len_V$31bc=19, .tmp.TIMER_COMP_RUN.len_V$31bf=11, .tmp.COMP_LAYOUT.len_V$31c5=32) at /global/cscratch1/sd/ndk/wacmy/m34-jun1/cime/src/drivers/mct/main/component_mod.F90:737
#27 0x0000000000444215 in cime_comp_mod::cime_run () at /global/cscratch1/sd/ndk/wacmy/m34-jun1/cime/src/drivers/mct/main/cime_comp_mod.F90:2819
#28 0x000000000046305d in cime_driver () at /global/cscratch1/sd/ndk/wacmy/m34-jun1/cime/src/drivers/mct/main/cime_driver.F90:133
#29 0x000000000041af12 in main ()
ndkeen commented 3 years ago

Updating that I tried a few more cases in attempt to get a fail with even fewer MPI's. These are all with DEBUG and Intel compiler using the above ne11 case settings. As the Waitall error happens in init, when it does not fail, the job will run several ATM steps before timing out (so assuming it's fine).

                               Waitall error
128 MPI's, 64x2                     Yes
64 MPI's, 64x2                      Yes
48 MPI's, 32x2                      Yes
36 MPI's, 36x2                      Yes
32 MPI's, 32x2                      No
32 MPI's, 32x4                      Yes
8 MPI's, 64x2                       No
4 MPI's, 4x2                        No
2 MPI's, 1x2                        No
1 MPI, 1x2                          No
oksanaguba commented 3 years ago

Do you also see this in your logs?

33:  bad state in EOS, called from: prim_state_mod
33:  bad i,j,k=           1           1           2
33:  vertical column: dphi,dp3d,vtheta_dp
33:   1   -31288.9767        4.5819    17827.6988
33:   2   -30799.0665       -4.9258    23107.6401
33:   3   -30501.7801       -1.9789    30101.0371
33:   4   -30400.4504       14.1629    39593.8737

This is what i got, i would think this is the issue, waitall may be a distraction?

Except that i may not be doing it the same way you do it, i have this error and waitall message. If it is EOS error, there are ways to deal with it.

ndkeen commented 3 years ago

Yes I do occasionally. But I currently think the waitall error is happening first. It might be good to rule that out though, using barriers.

This is one directory where I was running many different attempts. Changing the code to add more output or try different things.

/global/cscratch1/sd/ndk/e3sm_scratch/cori-knl/m36-jun25/f11.FC5AV1C-L.ne11_ne11.m36-jun25.n002p036t32x2.nr.pio2.eul.hydro5.nc01.DEBUG.intel.mpt7712.RQ64

There are 43 runs, and only 14 of them have the EOS error.

Also, occasionally, some of these runs do go beyond where the fail happens. The waitall fail will always happen at same location (i think on the 18th call of that exchange function). Therefore, it looks like that this is an error that may not always happen.

oksanaguba commented 3 years ago

I am trying to run with small dynamics timestep, the code crashes for 2 threads but not for 1. I will try standalone homme.

oksanaguba commented 3 years ago

I was able to almost reproduce this with standalone homme and cime debug flags. The code crashes at the same spot, but only every other time.

oksanaguba commented 3 years ago

More runs: All 20 runs with 1 thread, ne11, exec line

srun -n 36 -c 8   --cpu_bind=cores   -m plane=32 ../src/theta-l/theta-l < inp.nl &> out-nt${nt}-i${I}

did not crash. With 2 threads, only 1 out of 20 runs did not crash. With 4 threads, all 20 crashed.

oksanaguba commented 3 years ago

i got this to crash with standalone, ne4 and 2 mpi ranks, 4 threads, put in 2 omp barriers in print_state code (not in bndry_exchange in TOM HV), and error went away. Will try the F case again.

oksanaguba commented 3 years ago

While using intel17 I am not getting this crash with or without debug flags. With intel19 it crashes only with debug flags. I will clean up my code and re-run with master. I will try gnu as well. These were homme no-io runs.

oksanaguba commented 3 years ago

I tried to run this again today in the standalone homme from master from Aug 10th, with debug flags and intel19, and saw no issues (i used the same scripts + 4 threads as before). I then submitted Noel's script that failed for me before, for an F case, and it ran for a few timesteps. Before, the crash was in the 1st homme timestep. So, I consider this is resolved.

ndkeen commented 3 years ago

Indeed I just now tried to run some tests with a master (also from Aug 10th) and it is no longer failing.

Does it make sense to find the PR that "fixed it" ?

oksanaguba commented 3 years ago

I don't know, it can be cori as well?

rljacob commented 3 years ago

No don't bother.

ndkeen commented 3 years ago

I still get the error in the scream repo.

I can also verify that a master from July 31st still exhibits this behavior.

ndkeen commented 3 years ago

Actually, when I run the test again, I do get fails. So it's not failing every time with 36 MPI's and 32x2 . With 128 MPI's and 64x2, it's failing every time for me.

oksanaguba commented 3 years ago

i ran yesterday 20 runs in standalone homme but i will do it again.

oksanaguba commented 3 years ago

I repeated my runs with the old commit -- first, with 2 ranks only (because that setup was failing before), intel19, 4 threads, and they all passed. I changed number of ranks to 24 and saw fails. I tried the same, 24 ranks, with intel17 and there were fails. I checked out master from aug10th and again, there are fails for intel17 and intel19. Fails are nondeterministic.

oksanaguba commented 3 years ago

i should try one of knl testbeds. also, there is cori maint this week, we can wait till after it.

oksanaguba commented 3 years ago

On a test bed with intel 17 homme standalone completes 20 runs, while on cori 18 out of 20 runs fail.

ndkeen commented 3 years ago

I wanted to give another shot at this problem. There were 2 recent cases where this happened to others recently -- one of them being a ne4 case on cori-knl. Using the ne11 case that I used above, I tried again on cori-knl with Intel MPI (including newer version) again and ran into same error. I haven't managed to reproduce the error on cori-haswell, but haven't tried many cases. On Anvil, my first few attempts did not have an error, but then using more MPI's, I found the following. Because I haven't seen this before during many other DEBUG attempts, I'm not sure it's real or related to the issue above, but should be straight-forward to verify code is OK (Tagging @worleyph as I think he might have worked on it).

 67: forrtl: severe (408): fort: (2): Subscript #1 of the array LCHUNKS has value 999 which is greater than the upper bound of 998
 67:
 67: Image              PC                Routine            Line        Source
 67: e3sm.exe           000000000D4DCB90  Unknown               Unknown  Unknown
 67: e3sm.exe           000000000393C8B1  phys_grid_mp_phys        1015  phys_grid.F90
 67: e3sm.exe           000000000426DCA1  inital_mp_cam_ini          40  inital.F90
 67: e3sm.exe           000000000088064F  cam_comp_mp_cam_i         155  cam_comp.F90
 67: e3sm.exe           000000000083B745  atm_comp_mct_mp_a         315  atm_comp_mct.F90
 67: e3sm.exe           00000000004685ED  component_mod_mp_         257  component_mod.F90
 67: e3sm.exe           000000000042CF61  cime_comp_mod_mp_        1399  cime_comp_mod.F90
 67: e3sm.exe           000000000045F0E1  MAIN__                    122  cime_driver.F90
 67: e3sm.exe           000000000041A11E  Unknown               Unknown  Unknown
 67: libc-2.17.so       00002B590BBC03D5  __libc_start_main     Unknown  Unknown
 67: e3sm.exe           000000000041A029  Unknown               Unknown  Unknown

/lcrc/group/acme/ac.ndkeen/acme_scratch/anvil/m08-nov17/f11.ne11_ne11.m08-nov17.intel18.n008p144t18x2.DEBUG

This was with res=ne11_ne11, compset=FC5AV1C-L, compiler=intel18, and with 144 MPI's, 18 MPI's per node and 2 threads.

./xmlchange CAM_TARGET=theta-l

cat <<EOF >> user_nl_eam                                                                                                                                                                  
  semi_lagrange_cdr_alg = 3                                                                                                                                                               
  theta_hydrostatic_mode = .true.                                                                                                                                                         
  tstep_type = 5                                                                                                                                                                          
  state_debug_checks = .true.                                                                                                                                                             
EOF                                                                                                                                                                                       

Same error with 198 MPI's and 18x2, however, no error with 9 MPI's and 9x4, 18 MPI's and 9x4, or 72 MPI's and 18x2. All DEBUG=TRUE.

worleyph commented 3 years ago

@ndkeen , I've duplicated this error on Anvil and am looking at it now.

worleyph commented 3 years ago

Still investigating ... added some debug prints and looked like a real bug, but reran without DEBUG and these output are "correct". Someplace there is some integer arithmetic that is doing something different when DEBUG is set. I'll keep looking.

ndkeen commented 3 years ago

OK. I wanted to add that I also get same DEBUG fail with --compiler=intel/intel18. And it completes without threads (and DEBUG). It's possible that the root cause is further back, and then odd things start to happen afterwards.

worleyph commented 3 years ago

And I get the same error without using CAM_TARGET=theta-l .

ndkeen commented 3 years ago

OK, and then I just tried commenting ALL of that extra stuff as well (empty user_nl_eam) and it also stopped with same error.

worleyph commented 3 years ago

I identified a line where an error shows up when compiled with DEBUG, but have no idea why it is occurring. In subroutine proc_heap_adjust in phys_grid.F90 is code to remove a process from the heap of processes that can accept new chunks:

    if (remove) then
       ! Swap process with the end of the heap and                                                                                   
       ! decrement heap length by 1                                                                                                  
       cost_last    = cost(heap_len)
       proc_last    = heap(heap_len)
       cost_updated = cost(updated)
       proc_updated = heap(updated)
 !                                                                                                                                   
       heap(updated)          = proc_last
       cost(updated)          = cost_last
       inv_heap(proc_last)    = updated
 !                                                                                                                                   
       heap(heap_len)         = proc_updated
       cost(heap_len)         = cost_updated
       inv_heap(proc_updated) = heap_len
 !                                                                                                                                   
       heap_len = heap_len - 1
    endif

The process being "removed" is swapped with the process currently at the end of the heap within the heap data structure. The heap length is then decremented by one, putting the removed process outside of the heap.

In particular,

             heap(updated)          = proc_last

is executed correctly by both DEBUG and no-DEBUG builds, but by the end of this if-loop, 'heap(updated)' contains the original value ('proc_updated') in the DEBUG build and not the correct value ('proc_last'). This occurs immediately after

            inv_heap(proc_updated) = heap_len

which might imply memory overwriting, but 'inv_heap(proc_updated) ' is correct both before and after this line is executed, and is not the same value as heap(updated) in any case.

So, just stumps me that 'heap(updated)' is stuck at the original value (not even a garbage value as you would expect if a misaligned memory overwrite), and 'heap(heap_len)' is updated correctly (in both DEBUG and non-DEBUG builds). On the off-chance that some OpenMP region was added that included this part of the code, I added an OMP critical section around the above code, and it changed nothing, so not a threading issue as far as I can tell.

Note that this error is very persistent. Added write statements after many of the lines of this short section of code, and it changed nothing. The result is also deterministic.

Could try examining the impact of the different compiler flags that are set by DEBUG, but am not sure how to proceed otherwise. I see no way that this could be a code error, but it is a very strange compiler error (if that is what it is).

jayeshkrishna commented 3 years ago

I haven't looked at the code in detail, but could the issue be that the last proc in the heap is being updated (updated == heap_len)?

worleyph commented 3 years ago

No - I checked that. Deleted all DEBUG-specific compiler flags except for -O0, and nothing changed. Deleted -O0 (so default -O2), and it worked. Adding back in all of the original DEBUG-specific flags but keeping -O2 next.

ndkeen commented 3 years ago

Thanks Pat. Minor update from me: I see an error with GNU compiler and SMS_D_Ln5.ne4_ne4.FC5AV1C-L on cori-knl at the same MPI_Waitall as documented above. I started trying to use valgrind to help with this on Anvil, but running into other issues (too slow, stops at different places with valgrind).

worleyph commented 3 years ago

@ndkeen , I'm only looking at the Anvil case, which has a failure that has nothing to do with MPI_Waitall and occurs for all dycores (I assume).

ndkeen commented 3 years ago

I understand. But it may be that the root cause is something bad happening sooner and manifests in different ways.

oksanaguba commented 3 years ago

@ndkeen is error from mpi_waitall only on cori?

worleyph commented 3 years ago

For the error on Anvil, it is eliminated for the one reproducer:

 res=ne11_ne11, compset=FC5AV1C-L, compiler=intel, and with 144 MPI's, 18 MPI's per node and 2 threads.

by changing -O0 to -O1 (or -O2) in

 set(FFLAGS "${FFLAGS}  -O0 -g -check uninit -check bounds -check pointers -fpe0 -check noarg_temp_created")

e.g.

 set(FFLAGS "${FFLAGS}  -O1 -g -check uninit -check bounds -check pointers -fpe0 -check noarg_temp_created")

in Macros.cmake. Doesn't explain why, but only see the error when using -O0 .

worleyph commented 3 years ago

Failure on Anvil using default dycore is not reproducible on Cori-KNL (tried 144x2 and 64x2 with DEBUG==true). Cori-KNL uses intel19 and Anvil uses Intel17, but I don't know if that is the reason for the difference, or simply the very different processor architectures meaning that the compiler backends are very different. So, MPI_WAITALL issue with theta-l on Cori-KNL seems to be a spearate issue (as far as I can tell).

worleyph commented 3 years ago

@ndkeen , what would you suggest as a (reliable) reproducer for the failure on Cori-KNL? My first attempt:

 res=ne11_ne11, compset=FC5AV1C-L, compiler=intel

 ./xmlchange CAM_TARGET=theta-l
 ./xmlchange DEBUG=true

 cat <<EOF >> user_nl_eam                                                                                                                                                                  
   semi_lagrange_cdr_alg = 3                                                                                                                                                               
   theta_hydrostatic_mode = .true.                                                                                                                                                         
   tstep_type = 5                                                                                                                                                                          
   state_debug_checks = .true.                                                                                                                                                             
 EOF                                                                                                                                                                                       

with a 64x2 PE layout worked fine.

ndkeen commented 3 years ago

I think the only machine I've seen waitall error is on Cori KNL, but I've only tried cori-haswell and anvil, which hit other errors with DEBUG.

Pat: It looks like the waitall error on cori-knl will not always happen, but seems to be more likely to happen with more MPI's and more threads. I just ran 36 MPI's with 32 MPI's per node (so in this case using 2 nodes). With 2 threads it completed 3 steps, but with 4 threads, it failed with the waitall error.

Note that I may have uncovered use of uninitialized data here https://github.com/E3SM-Project/E3SM/issues/3956.

And thought I might try with GNU version 10, but hit https://github.com/E3SM-Project/E3SM/issues/3954

I've been trying to run the problem with valgrind. With a) 1 MPI and 1 thread b) 1 MPI and 2 threads d) with more MPI's c) intel and gnu d) anvil and cori-haswell In one case the use of valgrind causes the code to fail (runs without). Still trying to find something more useful to the issue.

worleyph commented 3 years ago

@ndkeen , thanks. Was able to reproduce the error using 64x4 decomposition. I replaced

     call MPI_Waitall(nRecvCycles,buffer%Rrequest,buffer%status,ierr)  

in bndry_exchangeV_core with

     do icycle=1,nRecvCycles
        write(6,*) "ICYCLE:",icycle
        call flush(6)
        call MPI_Wait(buffer%Rrequest(icycle),buffer%status(1,icycle),ierr)
     enddo

and got more informative error messages. Haven't seen anything interesting in the icycle output, but abort messages now look like:

 41: Rank 41 [Mon Nov 23 18:50:16 2020] [c5-1c1s8n1] Fatal error in PMPI_Wait: Message truncated, error stack:
 41: PMPI_Wait(207).....: MPI_Wait(request=0x2aaaaae32420, status=0x2aaaaae13680) failed
 41: MPIR_Wait_impl(100):
 41: do_cts(626)........: Message truncated; 9248 bytes received but buffer size is 3264

and

  3: Rank 3 [Mon Nov 23 18:50:16 2020] [c5-1c1s8n1] Fatal error in PMPI_Wait: Message truncated, error stack:
  3: PMPI_Wait(207)....................: MPI_Wait(request=0x2aaaaae3222c, status=0x2aaaaadffd1c) failed
  3: MPIR_Wait_impl(100)...............:
  3: MPIDI_CH3U_Receive_data_found(144): Message from rank 38 and tag 4 truncated; 1600 bytes received but buffer size is 960

I'll keep poking, but maybe someone else might diagnose this immediately based on the error messages. Of course, mismatches between send and receive message sizes shouldn't have anything to do with DEBUG compiler flags, so probably not anything simple.

ambrad commented 3 years ago

The message "1600 bytes received but buffer size is 960" intrigues me. If you go higher in the routine to where the MPI_Irecv calls are made, what do you see if you print icycle and length?

ambrad commented 3 years ago

Actually, looks like you can just set Debug=.true. at the top of the routine and get the full send/recv schedule. It would be interesting if there's a scheduled message longer than expected.

worleyph commented 3 years ago

Tried my "trick" from Anvil and changed -O0 to -O1 in the DEBUG flags, but the error persists. Next tried again with DEBUG==false (just to verify), and this does eliminate the error.

worleyph commented 3 years ago

Just repeating what was noted above - many tasks die with MPI errors, but some also die in

 60:  bad state in EOS, called from: prim_state_mod
 60:  bad i,j,k=           1           1           1
 60:  vertical column: dphi,dp3d,vtheta_dp
 ...
 60: e3sm.exe           00000000051483CC  shr_abort_mod_mp_         114  shr_abort_mod.F90
 60: e3sm.exe           000000000514822A  shr_abort_mod_mp_          61  shr_abort_mod.F90
 60: e3sm.exe           00000000026F68B3  eos_mp_pnh_and_ex         140  eos.F90

but maybe this is due to bad boundary exchange data.

worleyph commented 3 years ago

@ambrad

Actually, looks like you can just set Debug=.true. at the top of the routine and get the full send/recv schedule. It would be interesting if there's a scheduled message longer than expected.

Agreed. I'm poking at other stuff first (trying to determine which compiler flags are associated with the error appearing/disappearing), but will try this once I've identified the compiler flag sensitivities, so that I can compare working/not working output for the otherwise most similar cases.

ndkeen commented 3 years ago

When I first started looking at this, I also added several debug writes and tried to learn more, but after a few days I just got frustrated and needed to give it a break.

You can look here to see some of the debug writes: /global/cscratch1/sd/ndk/wacmy/m36-jun25/components/homme/src/share/bndry_mod_base.F90

I found that the error only occurs on certain calls to bndry_exchangeV(). I added writes to all places in the code that calls this function and I found:

34:  ndk calling bndry_exchangeV(par,edge) in mass_matrix_mod.F90
34:  ndk calling bndry_exchangeV(par,edge) #2 in mass_matrix_mod.F90
34:  ndk calling bndry_exchangeV(par,edge) in mass_matrix_mod.F90
34:  ndk calling bndry_exchangeV(par,edge) #2 in mass_matrix_mod.F90
34: ndk calling bndry_exchangeV in initdat.F90 edge_g%tag=         4 edge_g%id=         4 edge_g%nlyr=      3098 size(edge_g%status)=        30
34:  calling bndry_exchangeV(hybrid,edge_g) in gravity_waves_sources.F90
34:  calling bndry_exchangeV(hybrid,edge_g) in gravity_waves_sources.F90
34:  calling bndry_exchangeV(par, edge_g) in stepon.F90

Another thing I tried was actually looking at the error code that the error message indicates. I'm still not sure what exactly is stored in the buffer array, but printing out all of them. I actually found that some of these values were very wild -- so I thought this might have been a lead. However, I found that the odd values sometimes printed in this buffer array were not always there. I am leaning towards either a) the values in error buffer are nonsense, b) i'm grabbing or writing them wrong, c) there is something wrong before the code gets here.

    call MPI_Waitall(nRecvCycles,buffer%Rrequest,buffer%status,ierr)
    if(ierr .ne. MPI_SUCCESS) then
       print*, " ndk MPI_Waitall for recvs had bad ierr"
    endif
    do i=1,nRecvCycles
       !errorcode=buffer%status(2,i)                                                                                                                                                                                                                                
       !call MPI_Error_String(errorcode,errorstring,errorlen,ierr)                                                                                                                                                                                                  
       !write(*,'(a,i8,a,i10,1x,a80,a,10(i12,1x))') " ncall=", ncall, " i=", i, trim(adjustl(errorstring)), " recv buffer%status(i)=", buffer%status(1,i), buffer%status(2,i), buffer%status(3,i), buffer%status(4,i), buffer%status(5,i)                           
       write(*,'(a,i8,a,i10,1x,a,10(i12,1x))') " ncall=", ncall, " i=", i, " recv buffer%status(i)=", &
            buffer%status(1,i), buffer%status(2,i), buffer%status(3,i), buffer%status(4,i), buffer%status(5,i)
    enddo

Another thing I tried was using MPI_STATUS_IGNORE like so:

    call MPI_Waitall(nRecvCycles,buffer%Rrequest,MPI_STATUS_IGNORE,ierr) ! will yield diff error                                                                                                                                                                   

And this gave me a different error, so wasn't sure how much to pursue.

I would also see the same bad state in EOS error under certain situations (I can't recall if it was changing compilers, other options, or just every now and then). But it was not clear if this was a cause or a symptom.

ambrad commented 3 years ago

Has anyone shown this error does not happen with preqx? This issue started with some SCREAM runs, and thus the focus on theta-l, but it would be interesting to determine whether this issue is independent of dycore.

worleyph commented 3 years ago

I tried without using theta-l, and it worked fine through 10 steps.

I tried with all DEBUG settings in Macros.cmake and Macros.make identical to the non-DEBUG settings, and it still died, with divide-by-zero now. DEBUG is set to false in the CMAKE command line, but do not know how or where it is propagated inside. Apparently it does more than just set the compiler flags. Not sure how to determine what else it does (@jgfouca ?)

ambrad commented 3 years ago

I tried without using theta-l, and it worked fine through 10 steps.

Just to be clear since this problem is I believe nondeterministic: Multiple runs with preqx never fail?

One other thing worth reviewing: A complicated part of theta-l in our usual configuration is SL transport; SLT has complicated comm, including in initialization. But in the summer we ruled out SL transport by demonstrating this error occurs when these additional user_nl_eam settings are used to switch back to Eulerian transport:

  transport_alg = 0  
  dt_tracer_factor = 1 
  hypervis_subcycle_q = 1 

This combined with an assertion that the failure does not occur with preqx imply a small set of possible causes, which is interesting.

worleyph commented 3 years ago

ust to be clear since this problem is I believe nondeterministic: Multiple runs with preqx never fail?

My apologies ... I tested this before getting the deterministic error case (for me), using 64x4. I'll verify this.

worleyph commented 3 years ago

64x4 decomposition of

  ./create_newcase --case XXX -compset FC5AV1C-L -res ne11_ne11 -mach cori-knl -compiler intel -project e3sm

  ./xmlchange CAM_TARGET=theta-l
  ./xmlchange DEBUG=true

  cat <<EOF >> user_nl_eam                                                                                                                                                                  
    semi_lagrange_cdr_alg = 3                                                                                                                                                               
    theta_hydrostatic_mode = .true.                                                                                                                                                         
    tstep_type = 5                                                                                                                                                                          
    state_debug_checks = .true.                                                                                                                                                             
  EOF                                                                                                                                                                                       

fails for me everytime. As I change the definition of DEBUG, the particulars change, but it always fails.

I've run without

  ./xmlchange CAM_TARGET=theta-l

and without the changes to user_nl_eam twice now, both times successfully (at least until time ran out - around 8 timesteps for 15 minutes). Failures occurred in the first timestep (after output of 'nstep, te 0' but before anyother timestep output).