Open ndkeen opened 6 years ago
Thanks @ndkeen. I will look into it.
No surprise I'm also seeing this with the highres runs using 1950 compset.
SMS_Ld2_D_PT.ne120np4_oRRS18to6v3_ICG.A_WCYCL1950S_CMIP6_HR.cori-knl_gnu.cam-cosplite
169: At line 938 of file /global/cscratch1/sd/ndk/wacmy/m67-mar1/components/cam/src/physics/clubb/advance_clubb_core_module.F90
169: Fortran runtime error: Index '1' of dimension 1 of array 'skw_zt' outside of expected range (0:0)
169: Program aborted. Backtrace:
169:
169:
169: Error termination. Backtrace:
169:
169: Program aborted. Backtrace:
169:
169: Program aborted. Backtrace:
169: #0 0x33f595b in recursion_check
169: at ../../../cray-gcc-6.3.0-201701050407.93fe37becc347/libgfortran/runtime/error.c:326
169: #1 0xe9a7e0 in __advance_clubb_core_module_MOD_advance_clubb_core
169: at /global/cscratch1/sd/ndk/wacmy/m67-mar1/components/cam/src/physics/clubb/advance_clubb_core_module.F90:938
169: #0 0xe9a7e0 in __advance_clubb_core_module_MOD_advance_clubb_core
169: at /global/cscratch1/sd/ndk/wacmy/m67-mar1/components/cam/src/physics/clubb/advance_clubb_core_module.F90:938
169: #0 0x33f595b in recursion_check
169: at ../../../cray-gcc-6.3.0-201701050407.93fe37becc347/libgfortran/runtime/error.c:326
169: #1 0xb0327b in __clubb_intr_MOD_clubb_tend_cam
169: at /global/cscratch1/sd/ndk/wacmy/m67-mar1/components/cam/src/physics/cam/clubb_intr.F90:1882
169: #2 0xb0327b in __clubb_intr_MOD_clubb_tend_cam
169: at /global/cscratch1/sd/ndk/wacmy/m67-mar1/components/cam/src/physics/cam/clubb_intr.F90:1882
169: #1 0xe9a7e0 in __advance_clubb_core_module_MOD_advance_clubb_core
169: at /global/cscratch1/sd/ndk/wacmy/m67-mar1/components/cam/src/physics/clubb/advance_clubb_core_module.F90:938
169: #2 0xb0327b in __clubb_intr_MOD_clubb_tend_cam
169: at /global/cscratch1/sd/ndk/wacmy/m67-mar1/components/cam/src/physics/cam/clubb_intr.F90:1882
169: #0 0x33f595b in recursion_check
169: at ../../../cray-gcc-6.3.0-201701050407.93fe37becc347/libgfortran/runtime/error.c:326
169: #3 0x616b9b in tphysbc
169: at /global/cscratch1/sd/ndk/wacmy/m67-mar1/components/cam/src/physics/cam/physpkg.F90:2462
169: #4 0x61a8f6 in __physpkg_MOD_phys_run1._omp_fn.1
169: at /global/cscratch1/sd/ndk/wacmy/m67-mar1/components/cam/src/physics/cam/physpkg.F90:1029
169: #1 0xe9a7e0 in __advance_clubb_core_module_MOD_advance_clubb_core
169: at /global/cscratch1/sd/ndk/wacmy/m67-mar1/components/cam/src/physics/clubb/advance_clubb_core_module.F90:938
169: #3 0x616b9b in tphysbc
169: at /global/cscratch1/sd/ndk/wacmy/m67-mar1/components/cam/src/physics/cam/physpkg.F90:2462
Same error without cosplite modifier. And same error when I used the "small" size (PS).
!---------------------------------------------------------------------------
! Interpolate wp3 to momentum levels, and wp2 to thermodynamic levels
! and then compute Skw for m & t grid
!---------------------------------------------------------------------------
wp2_zt = max( zm2zt( wp2 ), w_tol_sqd ) ! Positive definite quantity
wp3_zm = zt2zm( wp3 )
Skw_zt(1:gr%nz) = Skw_func( wp2_zt(1:gr%nz), wp3(1:gr%nz) ) ! <-- line 938
Skw_zm(1:gr%nz) = Skw_func( wp2(1:gr%nz), wp3_zm(1:gr%nz) )
I first tried to re-order those 2 lines to verify that I get the same error with Skw_zm -- and I do. Then I noticed these are the only 2 arrays that are set with the (1:gr%nz) method, so now I'm trying to leave that off. However, wouldn't that mean that this code is exercised with many of these arrays at zero size?
!Skw_zm(1:gr%nz) = Skw_func( wp2(1:gr%nz), wp3_zm(1:gr%nz) ) ! ndk switch these 2 lines
!Skw_zt(1:gr%nz) = Skw_func( wp2_zt(1:gr%nz), wp3(1:gr%nz) )
Skw_zt = Skw_func( wp2_zt(1:gr%nz), wp3(1:gr%nz) ) ! ndk change, trying
Skw_zm = Skw_func( wp2(1:gr%nz), wp3_zm(1:gr%nz) )
The above change does allow flow to move beyond those 2 lines. Now I hit a different error in same routine:
347: At line 1077 of file /global/cscratch1/sd/ndk/wacmy/m67-mar1/components/cam/src/physics/clubb/advance_clubb_core_module.F90
347: Fortran runtime error: Index '1' of dimension 1 of array 'wprtp2' above upper bound of 0
@ndkeen (and @rljacob and @singhbalwinder ), note the similarity to the old issue from PR #1033 . That was a one trip loop bug in an explicit loop, and this is an implicit loop, but probably still the same issue underneath. Not sure how that helps.
I then added some prints, like so:
write(*,*) " ndk size(wp4_zt), size(wprtp2), size(wp2rtp)=", size(wp4_zt), size(wprtp2), size(wp2rtp)
Most of them are 73,73,73, but several print 73,1,1 So that would explain why we would see the next error.
The wprtp2 array is allocated elsewhere, while many arrays are dimensioned in the routine (such as wp4_zt).
call pdf_closure &
( hydromet_dim, p_in_Pa(k), exner(k), thv_ds_zt(k), wm_zt(k), & ! intent(in)
wp2_zt(k), wp3(k), sigma_sqd_w_zt(k), & ! intent(in)
Skw_zt(k), rtm(k), rtp2_zt(k), & ! intent(in)
zm2zt( wprtp, k ), thlm(k), thlp2_zt(k), & ! intent(in)
zm2zt( wpthlp, k ), rtpthlp_zt(k), tmp_sclrm, & ! intent(in)
tmp_wpsclrp_zt, tmp_sclrp2_zt, tmp_sclrprtp_zt, & ! intent(in)
tmp_sclrpthlp_zt, k, & ! intent(in)
#ifdef GFDL
RH_crit(k, : , :), do_liquid_only_in_clubb, & ! intent(in)
#endif
tmp_wphydrometp_zt, tmp_wp2hmp, & ! intent(in)
tmp_rtphmp_zt, tmp_thlphmp_zt, & ! intent(in)
wp4_zt(k), wprtp2(k), wp2rtp(k), & ! intent(out) !<-- error here
wpthlp2(k), wp2thlp(k), wprtpthlp(k), & ! intent(out)
cloud_frac(k), ice_supersat_frac(k), & ! intent(out)
I also ran with Intel debug and it looks like the value is always 73. I will see if value is 73 for GNU when NOT using DEBUG. etc...
Note I'm running with SMS_D_Ln5.ne16_ne16.FC5AV1C-L --compiler=gnu
I meant to say thanks @worleyph for the link. That does look very relevant. I just haven't had time to look at this closer, but this does seem like something we should be able to figure out. Maybe I can try finding an even smaller test that still fails in this way.
ne16 fails in same way. SMS_D_Ln5.ne16_ne16.FC5AV1C-L.cori-knl_gnu
Forcing 1 thread passes. SMS_D_Ln5_PMx1.ne16_ne16.FC5AV1C-L.cori-knl_gnu
ne4 passes.
I tried the high res case (that I'm most interested in) with DEBUG=TRUE and with only 1 thread and I got a different error.
659: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
659:
659: Backtrace for this error:
659: #0 0x33fd63f in ???
659: at /home/abuild/rpmbuild/BUILD/glibc-2.22/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
659: #1 0x18caf93 in compute_barycenter_coordinates
659: at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS_Ld2_D_PMx1_PT.ne120np4_oRRS18to6v3_ICG.A_WCYCL1950S_CMIP6_HR.cori-knl_gnu.cam-cosplite.20180315_183330_1dilfs/bld/ice/source/core_cice/shared/mpas_cice_advection_incremental_remap.f90:4809
659: #2 0x18dcb72 in construct_linear_tracer_fields
659: at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS_Ld2_D_PMx1_PT.ne120np4_oRRS18to6v3_ICG.A_WCYCL1950S_CMIP6_HR.cori-knl_gnu.cam-cosplite.20180315_183330_1dilfs/bld/ice/source/core_cice/shared/mpas_cice_advection_incremental_remap.f90:3897
659: #3 0x18fb6bb in incremental_remap_block
659: at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS_Ld2_D_PMx1_PT.ne120np4_oRRS18to6v3_ICG.A_WCYCL1950S_CMIP6_HR.cori-knl_gnu.cam-cosplite.20180315_183330_1dilfs/bld/ice/source/core_cice/shared/mpas_cice_advection_incremental_remap.f90:3081
659: #4 0x1908576 in __cice_advection_incremental_remap_MOD_cice_run_advection_incremental_remap
659: at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS_Ld2_D_PMx1_PT.ne120np4_oRRS18to6v3_ICG.A_WCYCL1950S_CMIP6_HR.cori-knl_gnu.cam-cosplite.20180315_183330_1dilfs/bld/ice/source/core_cice/shared/mpas_cice_advection_incremental_remap.f90:2586
659: #5 0x17537d0 in __cice_advection_MOD_cice_run_advection
659: at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS_Ld2_D_PMx1_PT.ne120np4_oRRS18to6v3_ICG.A_WCYCL1950S_CMIP6_HR.cori-knl_gnu.cam-cosplite.20180315_183330_1dilfs/bld/ice/source/core_cice/shared/mpas_cice_advection.f90:143
659: #6 0x175366a in __cice_time_integration_MOD_cice_timestep
659: at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS_Ld2_D_PMx1_PT.ne120np4_oRRS18to6v3_ICG.A_WCYCL1950S_CMIP6_HR.cori-knl_gnu.cam-cosplite.20180315_183330_1dilfs/bld/ice/source/core_cice/shared/mpas_cice_time_integration.f90:165
659: #7 0x1050145 in __ice_comp_mct_MOD_ice_run_mct
659: at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS_Ld2_D_PMx1_PT.ne120np4_oRRS18to6v3_ICG.A_WCYCL1950S_CMIP6_HR.cori-knl_gnu.cam-cosplite.20180315_183330_1dilfs/bld/ice/source/cice_acme_driver/ice_comp_mct.f90:830
659: #8 0x43a843 in __component_mod_MOD_component_ru
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS_Ld2_D_PMx1_PT.ne120np4_oRRS18to6v3_ICG.A_WCYCL1950S_CMIP6_HR.cori-knl_gnu.cam-cosplite.20180315_183330_1dilfs
Looking back at the error with ne16/ne30 and the issue Pat mentioned, I'm now trying the following:
if (gr%nz > 0) then
Skw_zt(1:gr%nz) = Skw_func( wp2_zt(1:gr%nz), wp3(1:gr%nz) ) ! ndk Index '1' of dimension 1 of array 'skw_zt' outside of expected range (0:0)
Skw_zm(1:gr%nz) = Skw_func( wp2(1:gr%nz), wp3_zm(1:gr%nz) )
endif
Any reason why that's not a good idea?
Makes sense.
Yea I tried that and it failed in same location with same error. I'm now trying to compile this file differently in hopes I can let the run continue.
Try changing the declaration of that function's result: in Skw_module.F90
- Skw ! Result Skw [-]
+ Skw(1:size(wp2)) ! Result Skw [-]
Skw(1:size(wp2)) ! Result Skw [-]
1
Error: 'array' argument of 'size' intrinsic at (1) must be an array
Change that function from scalar elemental
to an array-based pure
kind:
- elemental function Skw_func( wp2, wp3 ) &
+ pure function Skw_func( wp2, wp3 ) &
and
- real( kind = core_rknd ), intent(in) :: &
- wp2, & ! w'^2 [m^2/s^2]
+ real( kind = core_rknd ),dimension(:), intent(in) :: &
+ wp2, & ! w'^2 [m^2/s^2]
Hmm, just got:
337:
337: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
And running without bounds checking is also giving me invalid memory ref.
Okay. Other things to try is to continue removing 1:gr%nz
bounds like before, until no such errors are reported. Or changing gnu compiler version to a more recent one, which might have fixed this pickiness on zero-sized arrays.
I'll try a few more things. This issue might indeed be a pedantic compiler, but if it thinks the array is 0:0, then either that is wrong (and why?) or the index of 1 is wrong (and why?).
Note I did try a more recent GNU version early on. Currently we use version 6.3, and all of the version 7's have trouble building some code in the land https://github.com/ACME-Climate/ACME/issues/2145
OK, printing out more variables, I see that the Skw_zt variable is not only of size 1, but it's NOT even allocated on some ranks. It's starting to look like some ranks have done the allocation and some have not -- perhaps a race condition. I do see this is in a (large) threaded section.
An update on this issue as I've been working on this for a couple of days now. I'm placing lines in the code to query if the array Skw_zt is allocated or not. And what I see is that in certain sections of the code, where there are, say, 4 active threads, some threads have this array allocated and others do not. Obv that can't be good. Going all the way back to where the threads are launched, the array has already been allocated and the master thread verifies this. But just inside an openmp region, the threads disagree on who has allocated array and who does not. It could be a compiler bug. I'm now trying to compile only the ATM with GNU 7.2 (as the LND will not build with 7.2, for example). And obviously, this would not be an issue (and is not for me) with 1 thread. Note that it's still only certain ranks of the 402 MPI's that exhibit this behavior, but I have so far been seeing repeatable behavior (for example, if I look at same rank, I see the same thread numbers do the same things).
Would someone else be able to try this test on another machine? Maybe first using default settings, but then force it to have more threads (say 4). I did try on cori-haswell and was unable to repeat my issue here -- but I need to try a few more things to convince myself it's just a diff in machines -- ie the MPI count may be giving a section of code a particular workload.
I've tested these and they all have similar fate.
SMS_D_Ln5.ne16_ne16.FC5AV1C-L --compiler=gnu
SMS_PMx4_D_Ln5.ne16_ne16.FC5AV1C-L --compiler=gnu
SMS_P402x4_D_Ln5.ne16_ne16.FC5AV1C-L --compiler=gnu <-- will use 13 nodes
SMS_P402x2_D_Ln5.ne16_ne16.FC5AV1C-L --compiler=gnu
otherwise, should use 6 nodes (default num tasks per node is 128)
Now trying an older repo.
OK, looking at some old results I have saved from running acme_developer with GNU, I see that on cori-knl, this test passed with a master repo from ~Jan 11th and passed with next repo from Jan 19th. The next time I tried the test with GNU was with a repo from March 1st which is fail.
On edison, the ne16 test passes with GNU as we are only using 1 thread. Looking at the ne30 tests, which are failing in the same way, I see that a master repo (or branch) around Jan 18th (5f72e852d), I ran a successful test on edison,
/global/cscratch1/sd/ndk/acme_scratch/edison/mfu2018s2/SMS_D_Ln5.ne30_ne30.FC5AV1C-L.edison_gnu.stage2
and it failed with a repo from next Jan 19th (which is when I originally posted this issue).
/global/cscratch1/sd/ndk/acme_scratch/edison/n30-mjan19/SMS_D_Ln5.ne30_ne30.FC5AV1C-L.edison_gnu.20180120_094625_yw1ibe
Looking at the differences between those 2 cases, we are changing the PE layouts, including thread counts, as well as PIO_STRIDE, and a libsci version. A few things to test, but this could be a bug that only shows up with certain PE layouts.
In both cases noted above (ie pass/fails on cori/edison), the PE layouts are adjusted. So it does sound like something only triggered with specific workloads and not the repo itself (there was also no commits that should have affected this section of code).
I also went back to a repo from Nov 6th and created this test, but forced the 402 MPI's with 4 threads. Same failure. (on cori with gnu/debug)
Removing my last comment as I need to better understand threadprivate and how it works for allocatable array across files/modules.
I now see that threadprivate should be doing what we want, though I think the way it's used makes me nervous. Declare vars/arrays as threadprivate before going into OMP regions should indicate to compiler to make copies of each for all threads. So allocating in a parallel region, then ending parallel region and beginning another should still have those same arrays (on the same threads) allocated and with correct data.
Just update with more info:
I've narrowed down when/where the "mixup" is happening. Many arrays (not just Skw_zt) are declared threadprivate and allocated by all threads. And then OMP parallel regions are open/closed. I test if the array is allocated on all threads at various places. I see that in cam_comp.F90, all threads are happy with these arrays during cam_init. And, going into cam_run1, all threads are OK, but something happens after stepon_run1() and 1 of the threads (thread1 out of 0,1,2,3) claims the array is no longer allocated. I see that stepon_run1 (assuming the SE path) does indeed have it's own OMP parallel do loops.
subroutine cam_run1(cam_in, cam_out)
...
#ifdef _OPENMP
use omp_lib !ndk add to get thread num
#endif
use variables_diagnostic_module, only: Skw_zt, Skw_zm !ndk add to get access to these threadprivate arrays
....
! arrays are allocated on all threads here
!$OMP PARALLEL
#ifdef _OPENMP
write(*,*) omp_get_thread_num()," ndk in cam_comp.F90 in cam_run1 before stepon_run1 ptest allocated=", allocated(Skw_zt), allocated(Skw_zm)
#else
write(*,*) " ndk in cam_comp.F90 in cam_run1 before stepon_run1 ptest allocated=", allocated(Skw_zt), allocated(Skw_zm)
#endif
!$OMP END PARALLEL
!----------------------------------------------------------
! First phase of dynamics (at least couple from dynamics to physics)
! Return time-step for physics from dynamics.
!----------------------------------------------------------
call t_barrierf ('sync_stepon_run1', mpicom)
call t_startf ('stepon_run1')
call stepon_run1( dtime, phys_state, phys_tend, pbuf2d, dyn_in, dyn_out )
call t_stopf ('stepon_run1')
! here, not all threads have these arrays allocated. What happened?
!$OMP PARALLEL
#ifdef _OPENMP
write(*,*) omp_get_thread_num()," ndk in cam_comp.F90 after stepon_run1 ptest allocated=", allocated(Skw_zt), allocated(Skw_zm)
#else
write(*,*) " ndk in cam_comp.F90 after stepon_run1 ptest allocated=", allocated(Skw_zt), allocated(Skw_zm)
#endif
!$OMP END PARALLEL
341: 0 ndk in cam_comp.F90 in cam_run1 before stepon_run1 ptest allocated= T T
341: 2 ndk in cam_comp.F90 in cam_run1 before stepon_run1 ptest allocated= T T
341: 3 ndk in cam_comp.F90 in cam_run1 before stepon_run1 ptest allocated= T T
341: 1 ndk in cam_comp.F90 in cam_run1 before stepon_run1 ptest allocated= T T
341: 3 ndk in cam_comp.F90 after stepon_run1 ptest allocated= T T
341: 0 ndk in cam_comp.F90 after stepon_run1 ptest allocated= T T
341: 1 ndk in cam_comp.F90 after stepon_run1 ptest allocated= F F
341: 2 ndk in cam_comp.F90 after stepon_run1 ptest allocated= T T
Try with setenv OMP_DYNAMIC FALSE
on compute nodes. Maybe gcc defaults it to TRUE
.
I checked that OMP_DYNAMIC was default to false, but also tried with OMP_DYNAMIC explicitly set to false and nothing changed. I then tried adjusting a few other OMP related env variables and was unable to change behavior. Using Intel MPI stops with a different failure (it also fails with Intel compiler). I also get a failure with optimization flags that is likely the same error as what happens with DEBUG -- we don't get a descriptive message, but otherwise same.
And then as I couldn't go to sleep, I wondered why the ne4 case worked. I've only been testing it without threads. For ne4 problem, there are 96 elements. Using 96 MPI's means that it doesn't make sense to use threads, but it should still work -- using 2 or 4 threads with 96 MPI's does fail with GNU with a familiar error: libgomp: Thread creation failed: Resource temporarily unavailable
I've seen this before and I thought it was a situation where I could resubmit and not see it again. I would have to go back and look. Resubmitting in these cases did not work, nor did changing the the OMP_STACK size.
So then I tried using fewer MPI's for the ne4 case so that I could add threads. I got the same error (as original github issue) when trying 30x4 for the ne4 case. And then after some experimentation, I now see a pattern: If the total number of threads (MPI's x threads) is more than the number of elements, the run will fail with the original error of this github issue. If I run with less, it works. So my original attempt was using 402 MPIs and 4 threads, which is larger than 1536 elements in the ne16 case. Using 384x4 works. So does 300x4.
It would be nice to know why as it might be something we are doing still doing wrong, but it's also an easy fix to make sure the layouts are not asking for more total tasks. I suspect it's OK to ask for more MPI's than we have elements (will verify), as long as we don't use threads.
Different PE layouts based on compiler?
I'm going thru the layouts and slightly backing down on ATM MPI tasks or threads wherever there are cases that try to use more total tasks than elements. It could be that only cori-knl has these offending layouts, and only after updates I made a few months ago, and only a small amount. I'm testing they work and still have about same performance. So then we should be able to run with Intel/GNU.
It doesn't solve the issue of why we get the error, just easy way to avoid it.
Perhaps keep as open issue, but then do we "enforce" that total number of tasks (MPI's times threads for ATM) is less than number of elements?
@ndk and @mt5555 , looking at the code:
subroutine setup_diagnostic_variables
in variables_diagnostic_module.F90, and called in setup_clubb_core in advance_clubb_core_module.F90, the allocations are not done in a threaded region. The threadprivate declaration does guarantee that there will be a separate copy for each thread. Sounds like the threadprivate implementation is broken in GNU, not that this is bad logic?
So the array is allocated outside the parallel region, but then marked thread private. So when the parallel region starts, each thread should get a copy of the array. when the parallel region stops, i'm not sure which copy survives (probably the master thread?). When the parallel region starts again, then each thread should get another copy of the array on the master thread?
I think the parallel region is started higher up. I'm putting print statements around the code (seen below) and I see these arrays are allocated within a threaded region. But I say: If they were NOT allocated in a threaded region then that's not right -- if threadprivate, then I think you DO want them to be in threaded region. (right?). Also, regardless of this, to me what's dangerous is starting/stopping parallel regions and letting potential other OMP commands happen (?) and then expecting the arrays to be all the same on the same threads you expect.
subroutine setup_diagnostic_variables( nz )
! Description:
! Allocates and initializes prognostic scalar and array variables
! for the CLUBB model code
!-----------------------------------------------------------------------
use constants_clubb, only: &
em_min, & ! Constant(s)
zero
use parameters_model, only: &
hydromet_dim, & ! Variables
sclr_dim, &
edsclr_dim
use clubb_precision, only: &
core_rknd ! Variable(s)
#ifdef _OPENMP
use omp_lib
#endif
implicit none
! Input Variables
integer, intent(in) :: nz ! Nunber of grid levels [-]
! Local Variables
integer :: i
! --- Allocation ---
! Diagnostic variables
allocate( sigma_sqd_w_zt(1:nz) ) ! PDF width parameter interp. to t-levs.
allocate( Skw_zm(1:nz) ) ! Skewness of w on momentum levels
allocate( Skw_zt(1:nz) ) ! Skewness of w on thermodynamic levels
#ifdef _OPENMP
write(*,*) omp_get_thread_num()," ndk in setup_diagnostic_variables allocated=", allocated(Skw_zt), allocated(Skw_zm), " nz=", nz
#else
write(*,*) " ndk in setup_diagnostic_variables allocated=", allocated(Skw_zt), allocated(Skw_zm), " nz=", nz
#endif
In subroutine clubb_ini_cam, Here are some OMP threaded regions (pretty high up):
! Read in parameters for CLUBB. Just read in default values
!$OMP PARALLEL
call read_parameters( -99, "", clubb_params )
!$OMP END PARALLEL
! Fill in dummy arrays for height. Note that these are overwrote
! at every CLUBB step to physical values.
do k=1,pverp
zt_g(k) = ((k-1)*1000._r8)-500._r8 ! this is dummy garbage
zi_g(k) = (k-1)*1000._r8 ! this is dummy garbage
enddo
! Set up CLUBB core. Note that some of these inputs are overwrote
! when clubb_tend_cam is called. The reason is that heights can change
! at each time step, which is why dummy arrays are read in here for heights
! as they are immediately overwrote.
!$OMP PARALLEL
call setup_clubb_core &
( pverp, theta0, ts_nudge, & ! In
hydromet_dim, sclr_dim, & ! In
sclr_tol, edsclr_dim, clubb_params, & ! In
l_host_applies_sfc_fluxes, & ! In
l_uv_nudge, saturation_equation, & ! In
l_implemented, grid_type, zi_g(2), zi_g(1), zi_g(pverp), & ! In
zi_g(1:pverp), zt_g(1:pverp), zi_g(1), & ! In
err_code )
!$OMP END PARALLEL
@ambrad explained this to me: for pointers which are declared thread private, then all threads must allocate them (since each thread has it's own version of the pointer).
most of the clubb variables are done this way - a bunch of module pointer variables, marked thread private. The allocation is done inside a parallel region, as it should. now every thread has it's own copy of this variable. This data should never be used outside the parallel region, as it will be undefined (but probably you just get the masterthread version of the array)
FYI - see
http://tx.technion.ac.il/doc/intel13/ssadiag_docs/GUID-1B784663-59AE-4241-993C-B8BD9D14C407.htm
where it describes the behavior that Noel is worried about (and rightfully so, based on this discussion).
Nice find Pat. I did a fair amount of searching online for discussions about using threadprivate. I see the use of this when you have a lot of control (ie, within a single subroutine), but it just seems dangerous to me to "rely on" the threads of a parallel region being assigned the same way. And I did look at using COPYIN, but need to better understand it.
Just looking at Skw_zt, it is set every time advance_clubb_core is called:
Skw_zt(1:gr%nz) = Skw_func( wp2_zt(1:gr%nz), wp3(1:gr%nz) )
and all subsequent uses are 'intent(in)':
call stat_update_var( iSkw_zt, Skw_zt, & ! In
stats_zt ) ! In/Out
...
call pdf_closure &
( hydromet_dim, p_in_Pa(k), exner(k), thv_ds_zt(k), wm_zt(k), & ! intent(in)
wp2_zt(k), wp3(k), sigma_sqd_w_zt(k), & ! intent(in)
Skw_zt(k), rtm(k), rtp2_zt(k), & ! intent(in)
...
call pdf_closure &
( hydromet_dim, p_in_Pa(k), exner(k), thv_ds_zt(k), wm_zt(k), & ! intent(in)
wp2_zt(k), wp3(k), sigma_sqd_w_zt(k), & ! intent(in)
Skw_zt(k), rtm_frz(k), rtp2_zt(k), & ! intent(in)
...
call advance_wp2_wp3 &
( dt, sfc_elevation, sigma_sqd_w, wm_zm, wm_zt, & ! intent(in)
a3_coef, a3_coef_zt, wp3_on_wp2, & ! intent(in)
wpthvp, wp2thvp, um, vm, upwp, vpwp, & ! intent(in)
up2, vp2, Kh_zm, Kh_zt, tau_zm, tau_zt, tau_C1_zm, & ! intent(in)
Skw_zm, Skw_zt, rho_ds_zm, rho_ds_zt, & ! intent(in)
so there is no attempt to rely on values between calls (for this array). Looks like the only reason to use threadprivate is to allocate this array only once rather than making it a local array in advance_clubb_core?
Since the problem you are seeing is that the array does not even exist for some threads, still sounds like a threadprivate bug in GNU. There are LOTS of threadprivate variables in CLUBB, and perhaps others are not used safely, but I don't see any problem with Skw_zt. Am I missing something?
Note I can repeat the issue with ne4 as well. Just need a) to use threading and b) total number of tasks larger than number of elements. SMS_D_P30x4_Ln5.ne4_ne4.FC5AV1C.cori-knl_gnu
@ndkeen , one thing you might try (sorry if you have already done this) is to print out the number of threads in the parallel region when allocating Skw_zt, and then again when using Skw_zt (for the case when there are more computational threads than elements).
In dyn_init2, in dyn_comp.F90, is code
if(par%dynproc) then
....
#ifdef HORIZ_OPENMP
if (iam==0) write (iulog,*) "dyn_init2: hthreads=",hthreads,&
"max_threads=",omp_get_max_threads()
!$OMP PARALLEL NUM_THREADS(hthreads), DEFAULT(SHARED), PRIVATE(ie,ithr,nets,nete,hybrid)
#endif
#ifdef COLUMN_OPENMP
call omp_set_num_threads(vthreads)
#endif
...
#ifdef HORIZ_OPENMP
!$OMP END PARALLEL
#endif
end if
The above looks to be correct, but perhaps GNU is "remembering" NUM_THREADS(hthreads) for some reason? phys_init is called after the above (which calls clubb_ini_cam, which calls setup_clubb_core).
I might already have something that could answer the question. In this repo: /global/cscratch1/sd/ndk/wacmy/m72-mar14
I have added several print statements, that include the thread number and if these arrays are allocated or not. I can make a branch if it helps, but you can grep for "ndk" in that repo under cam to see where I've added prints.
modified: components/cam/src/physics/cam/clubb_intr.F90
modified: components/cam/src/physics/cam/physpkg.F90
modified: components/cam/src/physics/clubb/advance_clubb_core_module.F90
modified: components/cam/src/physics/clubb/grid_class.F90
modified: components/cam/src/physics/clubb/variables_diagnostic_module.F90
Here is a case where I've tried 30 MPI's with 4 threads each, which is 120 total tasks and more then the number of elements (96).
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS_D_P30x4_Ln5.ne4_ne4.FC5AV1C.cori-knl_gnu.m72-gh2044
If you just want to know how many threads are allocating and how many threads are using it, that's easy: it's 4 as expected. What doesn't make sense, is why does 1 of those threads, in certain parallel regions, claim that the arrays are no longer allocated?
If you just want to know how many threads are allocating and how many threads are using it, that's easy: it's 4 as expected.
What I was looking for. Thanks.
I've been thinking about this issue. It is exercising the ATM threading in a way we might not have seen before. We have cases where each MPI has say 4 threads, and then gets 4 or more columns, doing them 4 at a time. But in this case, we are asking for say 4 threads each, but for some MPI's, there is not enough columns to fill and they will get a total of less than 4 units of work (columns). It could be that Intel/GNU handle this differently. Should the code be allocating memory for all 4 threads, but then only operating on 3 columns? Or should it only allocate space for 3 and only launch 3 threads in each parallel region?
Running SMS_D_P30x4_Ln5.ne4_ne4.FC5AV1C.cori-knl_gnu
with a master of July 7th, I now get:
6: At line 726 of file /global/cscratch1/sd/ndk/wacmy/m37-jul7/components/cam/src/physics/clubb/grid_class.F90
6: Fortran runtime error: Index '-1' of dimension 1 of array 'gr%dzm' below lower bound of 0
6:
6: Error termination. Backtrace:
6: #0 0x1530c82 in __grid_class_MOD_setup_grid_heights
6: at /global/cscratch1/sd/ndk/wacmy/m37-jul7/components/cam/src/physics/clubb/grid_class.F90:726
6: #1 0x1515261 in __clubb_api_module_MOD_setup_grid_heights_api
6: at /global/cscratch1/sd/ndk/wacmy/m37-jul7/components/cam/src/physics/clubb/clubb_api_module.F90:1284
6: #2 0x1093bbc in __clubb_intr_MOD_clubb_tend_cam
6: at /global/cscratch1/sd/ndk/wacmy/m37-jul7/components/cam/src/physics/cam/clubb_intr.F90:1874
6: #3 0x1301beb in tphysbc
6: at /global/cscratch1/sd/ndk/wacmy/m37-jul7/components/cam/src/physics/cam/physpkg.F90:2542
6: #4 0x13261a5 in __physpkg_MOD_phys_run1._omp_fn.2
6: at /global/cscratch1/sd/ndk/wacmy/m37-jul7/components/cam/src/physics/cam/physpkg.F90:1054
6: #5 0x7a883ed in gomp_thread_start
6: at ../../../cray-gcc-8.3.0-201903122028.16ea96cb84a9a/libgomp/team.c:120
6: #6 0x443f9c8 in start_thread
6: at /home/abuild/rpmbuild/BUILD/glibc-2.26/nptl/pthread_create.c:465
6: #7 0x7b9b54e in ???
6: at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Still happening with master of Nov19th.
Using April 22nd master, I tried SMS_D_P30x4_Ln5.ne4_ne4.FC5AV1C-L.cori-knl_gnu
and I get what may be same error:
14: At line 726 of file /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_cori-add-PERL5LIB/components/eam/src/physics/clubb/grid_class.F90
14: Fortran runtime error: Index '-1' of dimension 1 of array 'gr%dzm' below lower bound of 0
14:
14: Error termination. Backtrace:
14: #0 0x1559520 in __grid_class_MOD_setup_grid_heights
14: at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_cori-add-PERL5LIB/components/eam/src/physics/clubb/grid_class.F90:726
14: #1 0x153daff in __clubb_api_module_MOD_setup_grid_heights_api
14: at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_cori-add-PERL5LIB/components/eam/src/physics/clubb/clubb_api_module.F90:1284
14: #2 0x10bbedb in __clubb_intr_MOD_clubb_tend_cam
14: at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_cori-add-PERL5LIB/components/eam/src/physics/cam/clubb_intr.F90:1961
14: #3 0x133913f in tphysbc
14: at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_cori-add-PERL5LIB/components/eam/src/physics/cam/physpkg.F90:2469
14: #4 0x135bd32 in __physpkg_MOD_phys_run1._omp_fn.2
14: at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_cori-add-PERL5LIB/components/eam/src/physics/cam/physpkg.F90:1047
14: #5 0x7cd855d in gomp_thread_start
We still an issue. I think it will happen when the total number of threads is larger than the total number of elements.
Example of failure:
SMS_D_P30x4_Ln5.ne4_ne4.F2010-CICE.cori-knl_gnu
Example of pass:
SMS_D_P48x2_Ln5.ne4_ne4.F2010-CICE.cori-knl_gnu
The fail message is different than above:
9: At line 726 of file /global/cfs/cdirs/e3sm/ndk/repos/me05-oct12/components/eam/src/physics/clubb/grid_class.F90
9: Fortran runtime error: Index '-1' of dimension 1 of array 'gr%dzm' below lower bound of 0
which I think is the same as https://github.com/E3SM-Project/E3SM/issues/3785
With master of Jan 2024, I tried on pm-cpu and see the same error.
SMS_D_P30x4_Ln5.ne4_ne4.F2010-CICE.pm-cpu_gnu
Update to more recent testname:
SMS_D_P30x4_Ln5.ne4_ne4.F2010-CICE.cori-knl_gnu
On edison (and later on cori-knl), the DEBUG version of the ne30 F case has the following issue. SMS_D_Ln5.ne30_ne30.FC5AV1C-L.edison_gnu
Intel DEBUG did not have a problem.
/global/cscratch1/sd/ndk/acme_scratch/edison/n30-mjan19/SMS_D_Ln5.ne30_ne30.FC5AV1C-L.edison_gnu.20180120_094625_yw1ibe