Open ndkeen opened 2 years ago
Using scream checkout of Jan 27th, I can verify that at least the following tests still have the same issue. SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR
SMS_D_P32x4.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu
10: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
10:
10: Backtrace for this error:
10: #0 0x7f96204e249f in ???
10: #0 0x7f96204e249f in ???
10: #1 0x93d613 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
10: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:280
10: #1 0x93d613 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
10: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:280
10: #2 0x882d36 in __aero_model_MOD_aero_model_wetdep
10: #2 0x882d36 in __aero_model_MOD_aero_model_wetdep
10: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/chemistry/bulk_aero/aero_model.F90:632
10: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/chemistry/bulk_aero/aero_model.F90:632
10: #3 0xd1e712 in tphysbc
10: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/cam/physpkg.F90:2678
10: #4 0xd3b7b1 in __physpkg_MOD_phys_run1._omp_fn.0
10: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/cam/physpkg.F90:1092
10: #5 0x7f9620ce0a55 in gomp_thread_start
10: at ../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libgomp/team.c:125
And trying with rrtmgpxx, same issue.
SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx
But, some of the above that were passing, now fail with:
SMS_D_P64x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu
SMS_D_P64x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx
SMS_D_P16x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu
24: *** Error in `/pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s08-jan27/SMS_D_P64x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx.20220201_113350_u7jy6p/bld/e3sm.exe': corrupted size vs. prev_size: 0x00000000123a0e00 ***
24:
24: Program received signal SIGABRT: Process abort signal.
24:
24: Backtrace for this error:
24: #0 0x7fd3a4da449f in ???
24: #1 0x7fd3a4da4420 in ???
24: #2 0x7fd3a4da5a00 in ???
24: #3 0x7fd3a4de7876 in ???
24: #4 0x7fd3a4dee092 in ???
24: #5 0x7fd3a4dee571 in ???
24: #6 0x7fd3a4df14cc in ???
24: #7 0x7fd3a4df2ef6 in ???
24: #8 0x24fb576 in phasechange_beta
24: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/elm/src/biogeophys/SoilTemperatureMod.F90:1300
24: #9 0x252e6e7 in __soiltemperaturemod_MOD_soiltemperature
24: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/elm/src/biogeophys/SoilTemperatureMod.F90:634
24: #10 0x1be0ddc in __elm_driver_MOD_elm_drv._omp_fn.4
24: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/elm/src/main/elm_driver.F90:1311
24: #11 0x7fd3a559a295 in GOMP_parallel
24: at ../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libgomp/parallel.c:178
24: #12 0x1bd9c64 in __elm_driver_MOD_elm_drv
24: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/elm/src/main/elm_driver.F90:1311
24: #13 0x1baf17b in __lnd_comp_mct_MOD_lnd_run_mct
24: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/elm/src/cpl/lnd_comp_mct.F90:512
24: #14 0x44024b in __component_mod_MOD_component_run
24: at /pscratch/sd/n/ndk/wacmy/s08-jan27/driver-mct/main/component_mod.F90:728
24: #15 0x42446f in __cime_comp_mod_MOD_cime_run
24: at /pscratch/sd/n/ndk/wacmy/s08-jan27/driver-mct/main/cime_comp_mod.F90:2881
24: #16 0x43d904 in cime_driver
24: at /pscratch/sd/n/ndk/wacmy/s08-jan27/driver-mct/main/cime_driver.F90:153
24: #17 0x43d96b in main
24: at /pscratch/sd/n/ndk/wacmy/s08-jan27/driver-mct/main/cime_driver.F90:23
A few other tests complete, but hit MEMLEAK tolerance. We know there is mem growth with rrtmgp and gnuv9, but at least no fails. These 2 tests pass:
SMS_P1x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx
SMS_D_P8x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx
Note I also tried using OMP_STACKSIZE=256M (default is 128M) and see same issue for at least one of these cases.
When I try on cori-knl with GNU (version8 or version9), I see some problems as well. The error is different and could maybe make a different issue (which I did: https://github.com/E3SM-Project/scream/issues/1393) I see the same errors with following tests:
SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu9
SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu9.eam-rrtmgpxx
SMS_D_P32x4.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu9
SMS_D_P64x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu9
SMS_D_P64x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu9.eam-rrtmgpxx
8: At line 60 of file /global/cscratch1/sd/ndk/wacmy/s46-feb1/components/eam/src/physics/cam/physics_utils.F90
8: Fortran runtime error: Dimension 1 of array 'drymmr' has extent 3 instead of 4
8:
The following test completes 5 days (though still shows mem growth). Because this test does not hit above error, it makes me think the error above might not be real (code looks OK?) and something else is happening.
SMS_D_P8x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu9
It's still true that no-threads seems ok, ie:
SMS_D_P64x1.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu
SMS_P64x1.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu
And I tried the following standard e3sm ne4 tests that were also OK:
SMS_D_P64x2.ne4pg2_ne4pg2.F2010-CICE.cori-knl_gnu
SMS_P64x1.ne4pg2_ne4pg2.F2010-CICE.cori-knl_gnu
SMS_P64x2.ne4pg2_ne4pg2.F2010-CICE.cori-knl_gnu
Looking at the lines of code the error messages pertain to... I don't see how anything could be wrong. The last error you meintion is in code @singhbalwinder just added, though, so could be related to that. Line 60 of physics_utils.F90 on master as of 2/1/22 looks to me to be "end function calculate_drymmr_from_wetmmr" rather than a computation of drymmr though. The drymmr calculation on line 58 seems correct though...
@noel: Is the runtime error Dimension 1 of array 'drymmr' has extent 3 instead of 4
reproducible with a specific PE layout? I don't see anything wrong with the code just by reading the code. I am guessing it may be some corner case where ncols
are being treated inconsistently. If it is reproducible, we may catch these corner cases by printout out the size of drymmr
array and ncols
and fix the code accordingly.
@singhbalwinder is it possible that the issue is that in, e.g., this line:
qv_dry = calculate_drymmr_from_wetmmr(ncol, pver, qv_wet_in, qv_wet_in)
qv_dry
is declared as qv_dry(pcols,pver)
but calculate_drymmr_from_wetmmr
returns an automatic array of size (ncol,pver)
, with ncol /= pcols
for the final chunk (for example)?
Nice catch Andrew. Is it possible that with the function being pure, the line number in error message isn't super helpful?
I think the line number is the function's end, suggesting that it's in the hand off to the calling function that the error has occurred, which is consistent with my guess above.
Thanks @ambrad . That may be it. Given that ncols<=pcols, I would assume that the existing code would be okay. Perhaps it breaks some fortran language standard as GNU is complaining about it. I only tested it with the Intel compiler.
I think one way to fix this would be:
qv_dry(1:ncols,:) = calculate_drymmr_from_wetmmr(ncol, pver, qv_wet_in, qv_wet_in)
I agree that's the fix.
With Intel, this test fails SMS_D_P64x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_intel
and I created a separate issue here https://github.com/E3SM-Project/scream/issues/1394
There are almost 7000 P3 warnings.
59: forrtl: error (76): Abort trap signal
59: Image PC Routine Line Source
59: e3sm.exe 000000000A95EC24 Unknown Unknown Unknown
59: e3sm.exe 000000000A59AAB0 Unknown Unknown Unknown
59: e3sm.exe 000000000AA24130 Unknown Unknown Unknown
59: e3sm.exe 000000000AE514B1 Unknown Unknown Unknown
59: e3sm.exe 000000000A6CF922 Unknown Unknown Unknown
59: e3sm.exe 000000000A69D47B Unknown Unknown Unknown
59: e3sm.exe 000000000A6A41D5 Unknown Unknown Unknown
59: e3sm.exe 0000000003C701D2 scream_abortutils 44 scream_abortutils.F90
59: e3sm.exe 0000000002F58873 shoc_mp_shoc_assu 2402 shoc.F90
59: e3sm.exe 0000000002F2DE66 shoc_mp_shoc_main 529 shoc.F90
59: e3sm.exe 000000000198ABD1 shoc_intr_mp_shoc 854 shoc_intr.F90
59: e3sm.exe 000000000187B966 physpkg_mp_tphysb 2531 physpkg.F90
59: e3sm.exe 00000000018483DF physpkg_mp_phys_r 1085 physpkg.F90
59: e3sm.exe 000000000A4A1013 Unknown Unknown Unknown
59: e3sm.exe 000000000A449BAA Unknown Unknown Unknown
59: e3sm.exe 000000000A44B456 Unknown Unknown Unknown
59: e3sm.exe 000000000A416BA5 Unknown Unknown Unknown
59: e3sm.exe 00000000018467CF physpkg_mp_phys_r 1070 physpkg.F90
59: e3sm.exe 0000000000896C7C cam_comp_mp_cam_r 258 cam_comp.F90
59: e3sm.exe 000000000085082E atm_comp_mct_mp_a 410 atm_comp_mct.F90
59: e3sm.exe 000000000046BD1A component_mod_mp_ 257 component_mod.F90
59: e3sm.exe 0000000000424A2D cime_comp_mod_mp_ 2291 cime_comp_mod.F90
Note that the stack trace above points to an end-run call in shoc.F90 at this point:
! Check to ensure Tl1_1 and Tl1_2 are not negative. endrun otherwise
if (Tl1_1 .le. 0._rtype) then
write(err_msg,*)'ERROR: Tl1_1 is .le. 0 before shoc_assumed_pdf_compute_qs in shoc. Tl1_1 is:',Tl1_1
call endscreamrun(err_msg)
endif
So this is a case we've seen a number of times before: P3 warnings about physical quantities followed by negative T in shoc.
@singhbalwinder and @ndkeen - have you two tried fixing the qv_dry fix that AMB suggested and checked if it fixes the problem? It seems like that's definitely a bug which could easily explain our problems, so we should check the impact of fixing it ASAP and before trying more runs with other compilers etc...
Balwinder noted that he had only tried with Intel compiler. So I was pointing out that running DEBUG Intel also shows an issue -- ie it's not a gnu-specific problem
I will issue a PR soon to fix the bug identified by @ambrad .
Awesome, thanks Balwinder! I'm a bit surprised our tests didn't catch this. Do any of you understand why?
As it looks like I've hit 2 different issues here, I created another one here https://github.com/E3SM-Project/scream/issues/1393
The original issue where I see an error with GNU v9 on perlmutter, is still present.
Bug concerning qv_dry
(ncols .ne. pcols) can be compiler specific. I have seen in the past that GNU compilers are generally pretty strict about array lengths. At that time, it was about character arrays. Intel is quite forgiving in that respect. May be we don't have tests with the GNU compiler or this particular GNU version. I think the compilers which are not strict about it would interpret the code correctly.
Are you suggesting the fail I see with Intel DEBUG is then unrelated to the Fortran runtime error: Dimension 1 of array 'drymmr' has extent 3 instead of 4
? That's possible. I made 2 new issues to describe.
Note the change suggested above does not seem to have fixed the error which I noted in the other issue.
@ndk: I think these two issues may be unrelated but I am not 100% sure. I have created a branch: "fix-wet-dry-ncol-pcols", which fixes the qv_dry
bug. I have tested it with the intel compiler and the SMS_D.ne4pg2_ne4pg2.F2010-SCREAM-LR.compy_intel
test passes on Compy. Would you please check to see if it fixes it for the GNU compiler?
After making changes from the branch fix-wet-dry-ncol-pcols
, the 2 most recent fails reported above with GNU DEBUG on cori-knl (Fortran runtime error: Dimension 1 of array 'drymmr' has extent 3 instead of 4
) and the error with Intel DEBUG are now corrected.
However, I do still see the original error -- even on cori-knl with GNU (ie not just perlmutter and not just gnu v9)
SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.cori-knl_gnu
10: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
10:
10: Backtrace for this error:
10: #0 0x42f408f in ???
10: at /home/abuild/rpmbuild/BUILD/glibc-2.26/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
10: #1 0x981bdd in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
10: at /global/cscratch1/sd/ndk/wacmy/s46-feb1/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:280
10: #2 0x8c1b28 in __aero_model_MOD_aero_model_wetdep
10: at /global/cscratch1/sd/ndk/wacmy/s46-feb1/components/eam/src/chemistry/bulk_aero/aero_model.F90:632
10: #3 0xd8c5fa in tphysbc
10: at /global/cscratch1/sd/ndk/wacmy/s46-feb1/components/eam/src/physics/cam/physpkg.F90:2678
10: #4 0xda9e84 in __physpkg_MOD_phys_run1._omp_fn.2
10: at /global/cscratch1/sd/ndk/wacmy/s46-feb1/components/eam/src/physics/cam/physpkg.F90:1092
10: #5 0x78e244d in gomp_thread_start
10: at ../../../cray-gcc-8.3.0-201903122028.16ea96cb84a9a/libgomp/team.c:120
10: #6 0x42ef088 in start_thread
10: at /home/abuild/rpmbuild/BUILD/glibc-2.26/nptl/pthread_create.c:465
I also still see same issues on PM.
Note that as posted above, not all threaded GNU cases are failing. Suggesting there may be a sporadic threading issue. For example SMS_D_P8x2.ne4pg2_ne4pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx
completes.
@ndk: I think these two issues may be unrelated but I am not 100% sure. I have created a branch: "fix-wet-dry-ncol-pcols", which fixes the
qv_dry
bug. I have tested it with the intel compiler and theSMS_D.ne4pg2_ne4pg2.F2010-SCREAM-LR.compy_intel
test passes on Compy. Would you please check to see if it fixes it for the GNU compiler?
I'm not @ndkeen :)
I looked at the modal_aero_wateruptake.F90 code. One thing I noticed, but almost certainly unrelated, is that in the allocation phase, one sees
!$OMP PARALLEL
allocate(maer(pcols,pver,nmodes),stat=istat)
if (istat .ne. 0) call endrun("Unable to allocate maer: "//errmsg(__FILE__,__LINE__) )
allocate(hygro(pcols,pver,nmodes), stat=istat)
if (istat .ne. 0) call endrun("Unable to allocate hygro: "//errmsg(__FILE__,__LINE__) )
The allocations look correct, as the module arrays are declared as threadprivate
. But istat
needs to be declared as private(istat)
in the !$OMP PARALLEL
line to be fully correct. That said, istat
is almost certainly always 0.
I'm a bit surprised we're calling modal aerosol code. I thought we were using SPA instead. I guess we're using MAM because this is a test, is v0, and is LR? Can we run with SPA instead? If the problem really is with MAM, the SPA run should complete without problems...
Wasn't there a recent change to make all "standard tests" use SPA?
This compset (F2010-SCREAM-LR
) is using spa (-chem spa
). The code seems to be calling routines for calculating particle size and the change in size due to water uptake. These routines were part of the design for the "prescribed" aerosols (for computing aerosols optics). I am not sure if they are needed for spa. @hassanbeydoun might know more about it.
Thanks for checking Balwinder.
In the ne4 DEBUG case that is failing, I do see:
CAM_CONFIG_OPTS: -mach perlmutter -phys default -phys default -shoc_sgs -microphys p3 -chem spa -nlev 72 -rad rrtmgp -bc_dep_to_snow_updates -cppdefs '-DSCREAM' -rad rrtmgp -rrtmgpxx
I also tried a ne30 DEBUG case on PM to see if there was a similar problem.
I tried: SMS_D_P512x2.ne30pg2_ne30pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx
which also has:
CAM_CONFIG_OPTS: -mach perlmutter -phys default -phys default -shoc_sgs -microphys p3 -chem spa -nlev 72 -rad rrtmgp -bc_dep_to_snow_updates -cppdefs '-DSCREAM' -rad rrtmgp -rrtmgpxx
and failed with a different error:
/pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s08-jan27/SMS_D_P512x2.ne30pg2_ne30pg2.F2010-SCREAM-LR.perlmutter_gnu.eam-rrtmgpxx.20220202_194435_nat2my
256: ERROR: WARNING: radiation_tend aer_ssa_bnd_sw: 391 values above threshold ; max = 1742899.5496155308
256: ERROR: WARNING: radiation_tend aer_ssa_bnd_sw: 359 values above threshold ; max = 1040325.2202597444
256: #0 0x38b230e in __shr_abort_mod_MOD_shr_abort_backtrace
256: #0 0x38b230e in __shr_abort_mod_MOD_shr_abort_backtrace
256: at /pscratch/sd/n/ndk/wacmy/s08-jan27/share/util/shr_abort_mod.F90:104
256: #1 0x38b24c0 in __shr_abort_mod_MOD_shr_abort_abort
256: at /pscratch/sd/n/ndk/wacmy/s08-jan27/share/util/shr_abort_mod.F90:61
256: #2 0x6fbfa3 in __cam_abortutils_MOD_endrun
256: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/utils/cam_abortutils.F90:59
256: at /pscratch/sd/n/ndk/wacmy/s08-jan27/share/util/shr_abort_mod.F90:104
256: #1 0x38b24c0 in __shr_abort_mod_MOD_shr_abort_abort
256: at /pscratch/sd/n/ndk/wacmy/s08-jan27/share/util/shr_abort_mod.F90:61
256: #2 0x6fbfa3 in __cam_abortutils_MOD_endrun
256: #3 0xa04b83 in __radiation_utils_MOD_handle_error
256: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/rrtmgp/radiation_utils.F90:563
256: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/utils/cam_abortutils.F90:59
256: #3 0xa04b83 in __radiation_utils_MOD_handle_error
256: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/rrtmgp/radiation_utils.F90:563
256: #4 0x9ef6eb in __radiation_MOD_radiation_tend
256: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/rrtmgp/radiation.F90:1396
256: #4 0x9ef6eb in __radiation_MOD_radiation_tend
256: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/rrtmgp/radiation.F90:1396
256: #5 0xd1e9eb in tphysbc
256: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/cam/physpkg.F90:2734
256: #6 0xd3b7b1 in __physpkg_MOD_phys_run1._omp_fn.0
256: #5 0xd1e9eb in tphysbc
256: at /pscratch/sd/n/ndk/wacmy/s08-jan27/components/eam/src/physics/cam/physpkg.F90:2734
Andrew: I also tried just commenting out the OMP PARALLEL pragmas around that section (why even do allocations in parallel?) and same error.
@brhillman just hit an error on pm-cpu trying to run scream v0 at ne1024. The error mesg looked familiar and I found it here. I was going to try again on pm-cpu, but it's having some issues.
I just checked out scream on chrysalis and tried with GNU (v9.3). Same error.
SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.chrysalis_gnu
0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
0:
0: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
0:
0: Backtrace for this error:
0:
0: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
0:
0: Backtrace for this error:
0: #0 0x15554ed4087f in ???
0: #0 0x15554ed4087f in ???
0: #1 0x971631 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
0: at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:280
0: #1 0x971631 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
0: at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:280
0: #2 0x8b6659 in __aero_model_MOD_aero_model_wetdep
0: at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/chemistry/bulk_aero/aero_model.F90:632
0: #2 0x8b6659 in __aero_model_MOD_aero_model_wetdep
0: at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/chemistry/bulk_aero/aero_model.F90:632
0: #3 0xd6d808 in tphysbc
0: at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/physics/cam/physpkg.F90:2721
0: #3 0xd6d808 in tphysbc
0: at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/physics/cam/physpkg.F90:2721
0: #4 0xd8c078 in __physpkg_MOD_phys_run1._omp_fn.0
0: #4 0xd8c078 in __physpkg_MOD_phys_run1._omp_fn.0
0: at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/physics/cam/physpkg.F90:1096
0: at /lcrc/group/e3sm/ac.ndkeen/wacmy/s05-may18/components/eam/src/physics/cam/physpkg.F90:1096
0: #5 0x15554f546715 in gomp_thread_start
0: at /tmp/svcbuilder/spack-stage-gcc-9.2.0-ugetvbp5jl5kgy7jwjloyf73vnhhw7db/spack-src/libgomp/team.c:123
0: #5 0x15554f546715 in gomp_thread_start
0: at /tmp/svcbuilder/spack-stage-gcc-9.2.0-ugetvbp5jl5kgy7jwjloyf73vnhhw7db/spack-src/libgomp/team.c:123
Tests do work with Intel
Why are we running modal_aero anything anyways? Doesn’t spa bypass that stuff?
Ben asked the same question. So maybe the issue is that SPA is someone not getting used here?
Why are we running modal_aero anything anyways? Doesn’t spa bypass that stuff?
Unfortunately SPA doesn't bypass all MAM calculations in v0. At some point we decided it wasn't worth doing all the bypassing for v0 with the v1 on the horizon.
I ran into this error on pm-cpu with RRM case an verified fail above still repeatable with July 5th scream repo
I don't think Hommexx supports RRM grids. Currently, for each 2d element, the C++ impl of the halo exchange assumes that there is at most 1 element that shares only a corner with it. That's clearly not true for RRM grids.
Adding support for RRM grids is on our todo list.
@bartgol thiis is v0 (EAM), which does support RRM.
Whoops, sorry.
I see the same error with scream master of Sep15th.
SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.chrysalis_gnu
Same issue with Oct28th scream repo on pm-cpu.
SMS_D_P16x8.ne4pg2_ne4pg2.F2010-SCREAM-LR.pm-cpu_gnu
I think this is an example where we have problems (that might not show up in same places) when using more total threads than elements.
Using Perlmutter, and only using the CPU's, I have been hitting some runtime errors. I can work-around by only using 1 thread. Note that PM only has GNU v9 and higher. Using scream repo from Nov 16th.
Here are the 2 types of errors:
These are the tests that have failed so far with 1 of the 2 errors above:
These test do not fail and complete 5 days.
The SMS tests do fail with MEMLEAK which may or may not be issue (https://github.com/E3SM-Project/scream/issues/1318)