Seg fault -- invalid memory reference with SMS ne4pg2_ne4pg2.F2010-SCREAM-LR tests using GNU v9 and threads

ndkeen commented 2 years ago

Using Perlmutter, and only using the CPU's, I have been hitting some runtime errors. I can work-around by only using 1 thread. Note that PM only has GNU v9 and higher. Using scream repo from Nov 16th.

Here are the 2 types of errors:

0:  nstep, te        8   0.25847765579133606E+10   0.25847736088010263E+10  -0.81554198863645263E-04   0.98511186977129837E+05
0:  nstep, te        9   0.25847977836494746E+10   0.25847941738245683E+10  -0.99825133346045537E-04   0.98511470124235551E+05
0:
0: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
0:
0: Backtrace for this error:
0: #0  0x7f468703f49f in ???
0: #1  0x16b38df in __micro_p3_interface_MOD_micro_p3_tend
0:      at /pscratch/sd/n/ndk/wacmy/crterai_ne256_working/components/eam/src/physics/cam/micro_p3_interface.F90:1311
0: #2  0xc4f909 in __microp_driver_MOD_microp_driver_tend
0:      at /pscratch/sd/n/ndk/wacmy/crterai_ne256_working/components/eam/src/physics/cam/microp_driver.F90:209
0: #3  0xdcea05 in tphysbc
0:      at /pscratch/sd/n/ndk/wacmy/crterai_ne256_working/components/eam/src/physics/cam/physpkg.F90:2608
0: #4  0xdef7f8 in __physpkg_MOD_phys_run1._omp_fn.0
0:      at /pscratch/sd/n/ndk/wacmy/crterai_ne256_working/components/eam/src/physics/cam/physpkg.F90:1092
0: #5  0x7f468783b3a5 in gomp_thread_start
0:      at ../../../cray-gcc-9.3.0-202103112153.1accb32b2394c/libgomp/team.c:123
0: #6  0x7f4687d964f8 in ???
0: #7  0x7f4687101ece in ???
0: #8  0xffffffffffffffff in ???
srun: error: nid003964: task 0: Segmentation fault

3: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
13: 
13: Backtrace for this error:
13: 
13: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
13: 
13: Backtrace for this error:
13: #0  0x7f8374f8949f in ???
13: #0  0x7f8374f8949f in ???
13: #1  0x93c797 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
13: #1  0x93c797 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
13:     at /pscratch/sd/n/ndk/wacmy/crterai_ne256_working/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:280
13:     at /pscratch/sd/n/ndk/wacmy/crterai_ne256_working/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:280
13: #2  0x881eba in __aero_model_MOD_aero_model_wetdep
13: #2  0x881eba in __aero_model_MOD_aero_model_wetdep
13:     at /pscratch/sd/n/ndk/wacmy/crterai_ne256_working/components/eam/src/chemistry/bulk_aero/aero_model.F90:632
13:     at /pscratch/sd/n/ndk/wacmy/crterai_ne256_working/components/eam/src/chemistry/bulk_aero/aero_model.F90:632
13: #3  0xdd2786 in tphysbc
13:     at /pscratch/sd/n/ndk/wacmy/crterai_ne256_working/components/eam/src/physics/cam/physpkg.F90:2673
13: #4  0xdef7f8 in __physpkg_MOD_phys_run1._omp_fn.0
13:     at /pscratch/sd/n/ndk/wacmy/crterai_ne256_working/components/eam/src/physics/cam/physpkg.F90:1092
13: #5  0x7f83757853a5 in gomp_thread_start
13:     at ../../../cray-gcc-9.3.0-202103112153.1accb32b2394c/libgomp/team.c:123

These are the tests that have failed so far with 1 of the 2 errors above:

SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_D_P1x16.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_D_P1x4.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_D_P32x4.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_D_P8x2.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_P1x8.ne4pg2_ne4pg2.F2010-SCREAM-LR

These test do not fail and complete 5 days.

SMS_D_P16x2.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_D_P1x1.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_D_P1x2.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_D_P1x8.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_D_P2x1.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_D_P2x16.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_D_P4x2.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_D_P64x1.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_D_P64x2.ne4pg2_ne4pg2.F2010-SCREAM-LR

The SMS tests do fail with MEMLEAK which may or may not be issue (https://github.com/E3SM-Project/scream/issues/1318)

ndkeen commented 2 years ago

Using scream checkout of Jan 27th, I can verify that at least the following tests still have the same issue. SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR SMS_D_P32x4.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu

10: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
10: 
10: Backtrace for this error:
10: #0  0x7f96204e249f in ???
10: #0  0x7f96204e249f in ???
10: #1  0x93d613 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
10:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:280
10: #1  0x93d613 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
10:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:280
10: #2  0x882d36 in __aero_model_MOD_aero_model_wetdep
10: #2  0x882d36 in __aero_model_MOD_aero_model_wetdep
10:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/chemistry/bulk_aero/aero_model.F90:632
10:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/chemistry/bulk_aero/aero_model.F90:632
10: #3  0xd1e712 in tphysbc
10:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/cam/physpkg.F90:2678
10: #4  0xd3b7b1 in __physpkg_MOD_phys_run1._omp_fn.0
10:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/cam/physpkg.F90:1092
10: #5  0x7f9620ce0a55 in gomp_thread_start
10:     at ../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libgomp/team.c:125

And trying with rrtmgpxx, same issue. SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx

But, some of the above that were passing, now fail with:

SMS_D_P64x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu
SMS_D_P64x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx
SMS_D_P16x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu

24: *** Error in `/pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s08-jan27/SMS_D_P64x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx.20220201_113350_u7jy6p/bld/e3sm.exe': corrupted size vs. prev_size: 0x00000000123a0e00 ***
24:
24: Program received signal SIGABRT: Process abort signal.
24:
24: Backtrace for this error:
24: #0  0x7fd3a4da449f in ???
24: #1  0x7fd3a4da4420 in ???
24: #2  0x7fd3a4da5a00 in ???
24: #3  0x7fd3a4de7876 in ???
24: #4  0x7fd3a4dee092 in ???
24: #5  0x7fd3a4dee571 in ???
24: #6  0x7fd3a4df14cc in ???
24: #7  0x7fd3a4df2ef6 in ???
24: #8  0x24fb576 in phasechange_beta
24:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/elm/src/biogeophys/SoilTemperatureMod.F90:1300
24: #9  0x252e6e7 in __soiltemperaturemod_MOD_soiltemperature
24:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/elm/src/biogeophys/SoilTemperatureMod.F90:634
24: #10  0x1be0ddc in __elm_driver_MOD_elm_drv._omp_fn.4
24:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/elm/src/main/elm_driver.F90:1311
24: #11  0x7fd3a559a295 in GOMP_parallel
24:     at ../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libgomp/parallel.c:178
24: #12  0x1bd9c64 in __elm_driver_MOD_elm_drv
24:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/elm/src/main/elm_driver.F90:1311
24: #13  0x1baf17b in __lnd_comp_mct_MOD_lnd_run_mct
24:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/elm/src/cpl/lnd_comp_mct.F90:512
24: #14  0x44024b in __component_mod_MOD_component_run
24:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/driver-mct/main/component_mod.F90:728
24: #15  0x42446f in __cime_comp_mod_MOD_cime_run
24:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/driver-mct/main/cime_comp_mod.F90:2881
24: #16  0x43d904 in cime_driver
24:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/driver-mct/main/cime_driver.F90:153
24: #17  0x43d96b in main
24:     at /pscratch/sd/n/ndk/wacmy/s08-jan27/driver-mct/main/cime_driver.F90:23

A few other tests complete, but hit MEMLEAK tolerance. We know there is mem growth with rrtmgp and gnuv9, but at least no fails. These 2 tests pass:

SMS_P1x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx
SMS_D_P8x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx

Note I also tried using OMP_STACKSIZE=256M (default is 128M) and see same issue for at least one of these cases.

ndkeen commented 2 years ago

When I try on cori-knl with GNU (version8 or version9), I see some problems as well. The error is different and could maybe make a different issue (which I did: https://github.com/E3SM-Project/scream/issues/1393) I see the same errors with following tests:

SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu9
SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu9.eam-rrtmgpxx
SMS_D_P32x4.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu9
SMS_D_P64x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu9
SMS_D_P64x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu9.eam-rrtmgpxx

 8: At line 60 of file /global/cscratch1/sd/ndk/wacmy/s46-feb1/components/eam/src/physics/cam/physics_utils.F90
 8: Fortran runtime error: Dimension 1 of array 'drymmr' has extent 3 instead of 4
 8:

The following test completes 5 days (though still shows mem growth). Because this test does not hit above error, it makes me think the error above might not be real (code looks OK?) and something else is happening.
SMS_D_P8x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu9

It's still true that no-threads seems ok, ie:

SMS_D_P64x1.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu
SMS_P64x1.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu

And I tried the following standard e3sm ne4 tests that were also OK:

SMS_D_P64x2.ne4pg2_ne4pg2.F2010-CICE.cori-knl_gnu
SMS_P64x1.ne4pg2_ne4pg2.F2010-CICE.cori-knl_gnu
SMS_P64x2.ne4pg2_ne4pg2.F2010-CICE.cori-knl_gnu

PeterCaldwell commented 2 years ago

Looking at the lines of code the error messages pertain to... I don't see how anything could be wrong. The last error you meintion is in code @singhbalwinder just added, though, so could be related to that. Line 60 of physics_utils.F90 on master as of 2/1/22 looks to me to be "end function calculate_drymmr_from_wetmmr" rather than a computation of drymmr though. The drymmr calculation on line 58 seems correct though...

singhbalwinder commented 2 years ago

@noel: Is the runtime error Dimension 1 of array 'drymmr' has extent 3 instead of 4 reproducible with a specific PE layout? I don't see anything wrong with the code just by reading the code. I am guessing it may be some corner case where ncols are being treated inconsistently. If it is reproducible, we may catch these corner cases by printout out the size of drymmr array and ncols and fix the code accordingly.

ambrad commented 2 years ago

@singhbalwinder is it possible that the issue is that in, e.g., this line:

qv_dry      = calculate_drymmr_from_wetmmr(ncol, pver, qv_wet_in,              qv_wet_in)

qv_dry is declared as qv_dry(pcols,pver) but calculate_drymmr_from_wetmmr returns an automatic array of size (ncol,pver), with ncol /= pcols for the final chunk (for example)?

ndkeen commented 2 years ago

Nice catch Andrew. Is it possible that with the function being pure, the line number in error message isn't super helpful?

ambrad commented 2 years ago

I think the line number is the function's end, suggesting that it's in the hand off to the calling function that the error has occurred, which is consistent with my guess above.

singhbalwinder commented 2 years ago

Thanks @ambrad . That may be it. Given that ncols<=pcols, I would assume that the existing code would be okay. Perhaps it breaks some fortran language standard as GNU is complaining about it. I only tested it with the Intel compiler.

I think one way to fix this would be:

qv_dry(1:ncols,:)      = calculate_drymmr_from_wetmmr(ncol, pver, qv_wet_in,              qv_wet_in)

ambrad commented 2 years ago

I agree that's the fix.

ndkeen commented 2 years ago

With Intel, this test fails SMS_D_P64x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_intel and I created a separate issue here https://github.com/E3SM-Project/scream/issues/1394

There are almost 7000 P3 warnings.

59: forrtl: error (76): Abort trap signal
59: Image              PC                Routine            Line        Source
59: e3sm.exe           000000000A95EC24  Unknown               Unknown  Unknown
59: e3sm.exe           000000000A59AAB0  Unknown               Unknown  Unknown
59: e3sm.exe           000000000AA24130  Unknown               Unknown  Unknown
59: e3sm.exe           000000000AE514B1  Unknown               Unknown  Unknown
59: e3sm.exe           000000000A6CF922  Unknown               Unknown  Unknown
59: e3sm.exe           000000000A69D47B  Unknown               Unknown  Unknown
59: e3sm.exe           000000000A6A41D5  Unknown               Unknown  Unknown
59: e3sm.exe           0000000003C701D2  scream_abortutils          44  scream_abortutils.F90
59: e3sm.exe           0000000002F58873  shoc_mp_shoc_assu        2402  shoc.F90
59: e3sm.exe           0000000002F2DE66  shoc_mp_shoc_main         529  shoc.F90
59: e3sm.exe           000000000198ABD1  shoc_intr_mp_shoc         854  shoc_intr.F90
59: e3sm.exe           000000000187B966  physpkg_mp_tphysb        2531  physpkg.F90
59: e3sm.exe           00000000018483DF  physpkg_mp_phys_r        1085  physpkg.F90
59: e3sm.exe           000000000A4A1013  Unknown               Unknown  Unknown
59: e3sm.exe           000000000A449BAA  Unknown               Unknown  Unknown
59: e3sm.exe           000000000A44B456  Unknown               Unknown  Unknown
59: e3sm.exe           000000000A416BA5  Unknown               Unknown  Unknown
59: e3sm.exe           00000000018467CF  physpkg_mp_phys_r        1070  physpkg.F90
59: e3sm.exe           0000000000896C7C  cam_comp_mp_cam_r         258  cam_comp.F90
59: e3sm.exe           000000000085082E  atm_comp_mct_mp_a         410  atm_comp_mct.F90
59: e3sm.exe           000000000046BD1A  component_mod_mp_         257  component_mod.F90
59: e3sm.exe           0000000000424A2D  cime_comp_mod_mp_        2291  cime_comp_mod.F90

ambrad commented 2 years ago

Note that the stack trace above points to an end-run call in shoc.F90 at this point:

      ! Check to ensure Tl1_1 and Tl1_2 are not negative. endrun otherwise
      if (Tl1_1 .le. 0._rtype) then
         write(err_msg,*)'ERROR: Tl1_1 is .le. 0 before shoc_assumed_pdf_compute_qs in shoc. Tl1_1 is:',Tl1_1
         call endscreamrun(err_msg)
      endif

So this is a case we've seen a number of times before: P3 warnings about physical quantities followed by negative T in shoc.

PeterCaldwell commented 2 years ago

@singhbalwinder and @ndkeen - have you two tried fixing the qv_dry fix that AMB suggested and checked if it fixes the problem? It seems like that's definitely a bug which could easily explain our problems, so we should check the impact of fixing it ASAP and before trying more runs with other compilers etc...

ndkeen commented 2 years ago

Balwinder noted that he had only tried with Intel compiler. So I was pointing out that running DEBUG Intel also shows an issue -- ie it's not a gnu-specific problem

singhbalwinder commented 2 years ago

I will issue a PR soon to fix the bug identified by @ambrad .

PeterCaldwell commented 2 years ago

Awesome, thanks Balwinder! I'm a bit surprised our tests didn't catch this. Do any of you understand why?

ndkeen commented 2 years ago

As it looks like I've hit 2 different issues here, I created another one here https://github.com/E3SM-Project/scream/issues/1393

The original issue where I see an error with GNU v9 on perlmutter, is still present.

singhbalwinder commented 2 years ago

Bug concerning qv_dry (ncols .ne. pcols) can be compiler specific. I have seen in the past that GNU compilers are generally pretty strict about array lengths. At that time, it was about character arrays. Intel is quite forgiving in that respect. May be we don't have tests with the GNU compiler or this particular GNU version. I think the compilers which are not strict about it would interpret the code correctly.

ndkeen commented 2 years ago

Are you suggesting the fail I see with Intel DEBUG is then unrelated to the Fortran runtime error: Dimension 1 of array 'drymmr' has extent 3 instead of 4? That's possible. I made 2 new issues to describe.

Note the change suggested above does not seem to have fixed the error which I noted in the other issue.

singhbalwinder commented 2 years ago

@ndk: I think these two issues may be unrelated but I am not 100% sure. I have created a branch: "fix-wet-dry-ncol-pcols", which fixes the qv_dry bug. I have tested it with the intel compiler and the SMS_D.ne4pg2_ne4pg2.F2010-SCREAM-LR.compy_intel test passes on Compy. Would you please check to see if it fixes it for the GNU compiler?

ndkeen commented 2 years ago

After making changes from the branch fix-wet-dry-ncol-pcols, the 2 most recent fails reported above with GNU DEBUG on cori-knl (Fortran runtime error: Dimension 1 of array 'drymmr' has extent 3 instead of 4) and the error with Intel DEBUG are now corrected.

However, I do still see the original error -- even on cori-knl with GNU (ie not just perlmutter and not just gnu v9)

SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu

10: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
10: 
10: Backtrace for this error:
10: #0  0x42f408f in ???
10:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
10: #1  0x981bdd in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
10:     at /global/cscratch1/sd/ndk/wacmy/s46-feb1/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:280
10: #2  0x8c1b28 in __aero_model_MOD_aero_model_wetdep
10:     at /global/cscratch1/sd/ndk/wacmy/s46-feb1/components/eam/src/chemistry/bulk_aero/aero_model.F90:632
10: #3  0xd8c5fa in tphysbc
10:     at /global/cscratch1/sd/ndk/wacmy/s46-feb1/components/eam/src/physics/cam/physpkg.F90:2678
10: #4  0xda9e84 in __physpkg_MOD_phys_run1._omp_fn.2
10:     at /global/cscratch1/sd/ndk/wacmy/s46-feb1/components/eam/src/physics/cam/physpkg.F90:1092
10: #5  0x78e244d in gomp_thread_start
10:     at ../../../cray-gcc-8.3.0-201903122028.16ea96cb84a9a/libgomp/team.c:120
10: #6  0x42ef088 in start_thread
10:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/nptl/pthread_create.c:465

I also still see same issues on PM.

Note that as posted above, not all threaded GNU cases are failing. Suggesting there may be a sporadic threading issue. For example SMS_D_P8x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx completes.

ndk commented 2 years ago

@ndk: I think these two issues may be unrelated but I am not 100% sure. I have created a branch: "fix-wet-dry-ncol-pcols", which fixes the qv_dry bug. I have tested it with the intel compiler and the SMS_D.ne4pg2_ne4pg2.F2010-SCREAM-LR.compy_intel test passes on Compy. Would you please check to see if it fixes it for the GNU compiler?

I'm not @ndkeen :)

ambrad commented 2 years ago

I looked at the modal_aero_wateruptake.F90 code. One thing I noticed, but almost certainly unrelated, is that in the allocation phase, one sees

   !$OMP PARALLEL
   allocate(maer(pcols,pver,nmodes),stat=istat)
   if (istat .ne. 0) call endrun("Unable to allocate maer:       "//errmsg(__FILE__,__LINE__) )
   allocate(hygro(pcols,pver,nmodes),   stat=istat)
   if (istat .ne. 0) call endrun("Unable to allocate hygro:      "//errmsg(__FILE__,__LINE__) )

The allocations look correct, as the module arrays are declared as threadprivate. But istat needs to be declared as private(istat) in the !$OMP PARALLEL line to be fully correct. That said, istat is almost certainly always 0.

PeterCaldwell commented 2 years ago

I'm a bit surprised we're calling modal aerosol code. I thought we were using SPA instead. I guess we're using MAM because this is a test, is v0, and is LR? Can we run with SPA instead? If the problem really is with MAM, the SPA run should complete without problems...

ndkeen commented 2 years ago

Wasn't there a recent change to make all "standard tests" use SPA?

singhbalwinder commented 2 years ago

This compset (F2010-SCREAM-LR) is using spa (-chem spa). The code seems to be calling routines for calculating particle size and the change in size due to water uptake. These routines were part of the design for the "prescribed" aerosols (for computing aerosols optics). I am not sure if they are needed for spa. @hassanbeydoun might know more about it.

ndkeen commented 2 years ago

Thanks for checking Balwinder. In the ne4 DEBUG case that is failing, I do see: CAM_CONFIG_OPTS: -mach perlmutter -phys default -phys default -shoc_sgs -microphys p3 -chem spa -nlev 72 -rad rrtmgp -bc_dep_to_snow_updates -cppdefs '-DSCREAM' -rad rrtmgp -rrtmgpxx

I also tried a ne30 DEBUG case on PM to see if there was a similar problem. I tried: SMS_D_P512x2.ne30pg2_ne30pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx

which also has: CAM_CONFIG_OPTS: -mach perlmutter -phys default -phys default -shoc_sgs -microphys p3 -chem spa -nlev 72 -rad rrtmgp -bc_dep_to_snow_updates -cppdefs '-DSCREAM' -rad rrtmgp -rrtmgpxx

and failed with a different error:

/pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s08-jan27/SMS_D_P512x2.ne30pg2_ne30pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx.20220202_194435_nat2my

256:  ERROR:  WARNING: radiation_tend aer_ssa_bnd_sw:          391  values above threshold ; max =    1742899.5496155308
256:  ERROR:  WARNING: radiation_tend aer_ssa_bnd_sw:          359  values above threshold ; max =    1040325.2202597444
256: #0  0x38b230e in __shr_abort_mod_MOD_shr_abort_backtrace
256: #0  0x38b230e in __shr_abort_mod_MOD_shr_abort_backtrace
256:    at /pscratch/sd/n/ndk/wacmy/s08-jan27/share/util/shr_abort_mod.F90:104
256: #1  0x38b24c0 in __shr_abort_mod_MOD_shr_abort_abort
256:    at /pscratch/sd/n/ndk/wacmy/s08-jan27/share/util/shr_abort_mod.F90:61
256: #2  0x6fbfa3 in __cam_abortutils_MOD_endrun
256:    at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/utils/cam_abortutils.F90:59
256:    at /pscratch/sd/n/ndk/wacmy/s08-jan27/share/util/shr_abort_mod.F90:104
256: #1  0x38b24c0 in __shr_abort_mod_MOD_shr_abort_abort
256:    at /pscratch/sd/n/ndk/wacmy/s08-jan27/share/util/shr_abort_mod.F90:61
256: #2  0x6fbfa3 in __cam_abortutils_MOD_endrun
256: #3  0xa04b83 in __radiation_utils_MOD_handle_error
256:    at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/rrtmgp/radiation_utils.F90:563
256:    at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/utils/cam_abortutils.F90:59
256: #3  0xa04b83 in __radiation_utils_MOD_handle_error
256:    at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/rrtmgp/radiation_utils.F90:563
256: #4  0x9ef6eb in __radiation_MOD_radiation_tend
256:    at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/rrtmgp/radiation.F90:1396
256: #4  0x9ef6eb in __radiation_MOD_radiation_tend
256:    at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/rrtmgp/radiation.F90:1396
256: #5  0xd1e9eb in tphysbc
256:    at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/cam/physpkg.F90:2734
256: #6  0xd3b7b1 in __physpkg_MOD_phys_run1._omp_fn.0
256: #5  0xd1e9eb in tphysbc
256:    at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/cam/physpkg.F90:2734

Andrew: I also tried just commenting out the OMP PARALLEL pragmas around that section (why even do allocations in parallel?) and same error.

ndkeen commented 2 years ago

@brhillman just hit an error on pm-cpu trying to run scream v0 at ne1024. The error mesg looked familiar and I found it here. I was going to try again on pm-cpu, but it's having some issues.

I just checked out scream on chrysalis and tried with GNU (v9.3). Same error.

SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.chrysalis_gnu

 0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
 0: 
 0: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
 0: 
 0: Backtrace for this error:
 0: 
 0: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
 0: 
 0: Backtrace for this error:
 0: #0  0x15554ed4087f in ???
 0: #0  0x15554ed4087f in ???
 0: #1  0x971631 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
 0:     at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:280
 0: #1  0x971631 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
 0:     at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:280
 0: #2  0x8b6659 in __aero_model_MOD_aero_model_wetdep
 0:     at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/chemistry/bulk_aero/aero_model.F90:632
 0: #2  0x8b6659 in __aero_model_MOD_aero_model_wetdep
 0:     at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/chemistry/bulk_aero/aero_model.F90:632
 0: #3  0xd6d808 in tphysbc
 0:     at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/physics/cam/physpkg.F90:2721
 0: #3  0xd6d808 in tphysbc
 0:     at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/physics/cam/physpkg.F90:2721
 0: #4  0xd8c078 in __physpkg_MOD_phys_run1._omp_fn.0
 0: #4  0xd8c078 in __physpkg_MOD_phys_run1._omp_fn.0
 0:     at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/physics/cam/physpkg.F90:1096
 0:     at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/physics/cam/physpkg.F90:1096
 0: #5  0x15554f546715 in gomp_thread_start
 0:     at /tmp/svcbuilder/spack-stage-gcc-9.2.0-ugetvbp5jl5kgy7jwjloyf73vnhhw7db/spack-src/libgomp/team.c:123
 0: #5  0x15554f546715 in gomp_thread_start
 0:     at /tmp/svcbuilder/spack-stage-gcc-9.2.0-ugetvbp5jl5kgy7jwjloyf73vnhhw7db/spack-src/libgomp/team.c:123

Tests do work with Intel

PeterCaldwell commented 2 years ago

Why are we running modal_aero anything anyways? Doesn’t spa bypass that stuff?

ndkeen commented 2 years ago

Ben asked the same question. So maybe the issue is that SPA is someone not getting used here?

hassanbeydoun commented 2 years ago

Why are we running modal_aero anything anyways? Doesn’t spa bypass that stuff?

Unfortunately SPA doesn't bypass all MAM calculations in v0. At some point we decided it wasn't worth doing all the bypassing for v0 with the v1 on the horizon.

ndkeen commented 2 years ago

I ran into this error on pm-cpu with RRM case an verified fail above still repeatable with July 5th scream repo

bartgol commented 2 years ago

I don't think Hommexx supports RRM grids. Currently, for each 2d element, the C++ impl of the halo exchange assumes that there is at most 1 element that shares only a corner with it. That's clearly not true for RRM grids.

Adding support for RRM grids is on our todo list.

ambrad commented 2 years ago

@bartgol thiis is v0 (EAM), which does support RRM.

bartgol commented 2 years ago

Whoops, sorry.

ndkeen commented 2 years ago

I see the same error with scream master of Sep15th. SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.chrysalis_gnu

ndkeen commented 1 year ago

Same issue with Oct28th scream repo on pm-cpu. SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.pm-cpu_gnu

I think this is an example where we have problems (that might not show up in same places) when using more total threads than elements.

E3SM-Project / scream

Seg fault -- invalid memory reference with SMS ne4pg2_ne4pg2.F2010-SCREAM-LR tests using GNU v9 and threads #1317