E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
347 stars 354 forks source link

A_WCYCL2000 ne30_oEC error on Cori when using OpenMP with nested OpenMP support #783

Closed worleyph closed 8 years ago

worleyph commented 8 years ago

This may be premature, but I would like help verifying, or not, the following issue. I have been running extensive experiments with different PE layouts, including different levels of threading, on Cori, Edison, and Titan. These have been working fine, with the version of master right after adding DRV_THREADING support.

I recently moved to the most recent version, with nested OpenMP support, and this is working fine with MPI only. However, a first test with MPI+OpenMP resulted in a failure of the following form. (Note, this did not specify vertical threads.)

 2604: forrtl: severe (174): SIGSEGV, segmentation fault occurred
 2604: Image              PC                Routine            Line        Source
 2604: cesm.exe           000000000491C2C5  Unknown               Unknown  Unknown
 ...
 2604: cesm.exe           0000000000F1D14F  rrtmg_lw_rtrnmc_m         325  rrtmg_lw_rtrnmc.f90
 2604: cesm.exe           0000000000F123D1  rrtmg_lw_rad_mp_r         499  rrtmg_lw_rad.f90
 2604: cesm.exe           000000000063D351  radlw_mp_rad_rrtm         221  radlw.F90
 2604: cesm.exe           000000000062C099  radiation_mp_radi        1166  radiation.F90

one time, and just a failure without an error message in another:

 srun: error: nid00314: task 613: Exited with exit code 174

I'll continue looking at this, but it would be helpful if someone else could reproduce the problem as I also have some other modifications in the code (which should be irrelevant, but who knows).

Again, the compset, resolution, and machine are

 -compset A_WCYCL2000 -res ne30_oEC -mach corip1

and the env_mach_pes.xml file

 env_mach_pes.xml_Corip1_ATM2700x2HT_A_WCYCL2000_ne30_oEC

from https://acme-climate.atlassian.net/wiki/x/NoD4Ag . The user_nl_cam was also modified, as per instructions on this page. My case directory (with these files) is

 ~worleyph/ACME/douglasjacobsen/ocean_coupling/ACME/cime/scripts/A_WCYCL2000.ne30_oEC.corip1_master_DJb

Also note that I have not tried this particular experiment anywhere else yet, so do not know yet whether this is a Cori-specific problem.

worleyph commented 8 years ago

Checked out fresh version of master. Also dies, in/after nstep 14 in atmosphere or ice (overlapped for 0-2047 processes). Since only ATM has threading, probably ATM.

 srun: error: nid01056: task 1208: Exited with exit code 174
 srun: error: nid01084: task 1722: Exited with exit code 174
worleyph commented 8 years ago

And I then modified env_mach_specific to make this MPI-only (but same layout and same number of processes per node), and this worked fine. So it is an OpenMP issue, and a recent one?

amametjanov commented 8 years ago

Thanks Pat. Trying to re-produce on a smaller case with -compset FC5AV1C-01 -res ne30_oEC and the same user_nl_cam. Couldn't open your run-dir. Based on the stack-trace, there are no OpenMP constructs in those routines, which might mean that the default stack size of 64M in env_mach_specific is not enough.

worleyph commented 8 years ago

Please also check out

 ~worleyph/ACME/master/ACME/cime/scripts/A_WCYCL2000.ne30_oEC.corip1_master_DJb2
 /global/cscratch1/sd/worleyph/acme_scratch/A_WCYCL2000.ne30_oEC.corip1_master_DJb2

(failed, with OpenMP)

 ~worleyph/ACME/master/ACME/cime/scripts/A_WCYCL2000.ne30_oEC.corip1_master_DJb3
 /global/cscratch1/sd/worleyph/acme_scratch/A_WCYCL2000.ne30_oEC.corip1_master_DJb3

(worked, without OpenMP). I have changed permissions.

worleyph commented 8 years ago

Note that the only differences between the working and non-working source are in the following:

 cime/driver_cpl/driver/cesm_comp_mod.F90
 cime/machines-acme/config_compilers.xml
 components/cam/bld/build-namelist
 components/cam/bld/configure
 components/cam/bld/namelist_files/namelist_defaults_cam.xml
 components/cam/bld/namelist_files/namelist_definition.xml
 components/cam/src/dynamics/se/dyn_comp.F90
 components/clm/src/biogeochem/CNAllocationMod.F90
 components/clm/src/biogeochem/CNNitrogenFluxType.F90
 components/clm/src/biogeochem/PhosphorusFluxType.F90
 components/clm/src/cpl/lnd_comp_esmf.F90
 components/clm/src/cpl/lnd_comp_mct.F90
 components/clm/src/main/clm_driver.F90
 components/clm/src/main/clm_initializeMod.F90
 components/homme/src/share/namelist_mod.F90
 components/homme/src/share/thread_mod.F90

Nothing obviously would require additional stack size, though perhaps just the option of using nested OpenMP could increase the thread stack requirements?

mt5555 commented 8 years ago

@amametjanov : can nested be disabled by removing "-DCOLUMN_OPENMP" from cam/bld/configure?

mt5555 commented 8 years ago

I ran the FC5AV1C compset with 4 threads on Edison, using master, it ran fine for 1 day: ./create_test -c SMS_Ld1_P96x4.ne30_ne30.FC5AV1C

I'm trying to run SMS.ne30_oEC.A_WCYCL2000 on Edison, with threads, next...

amametjanov commented 8 years ago

Yes, removing -DCOLUMN_OPENMP from components/cam/bld/configure will disable nested threading.

worleyph commented 8 years ago

Reran Cori experiment with twice the threadstack size:

 setenv OMP_STACKSIZE 128M

and it died in the same general way (after 12 timesteps?). It could be that the simulation is going off the rails, but I also got the same error when running with the new ocean coupling and ocean timestep, so that is not the source of the problem.

worleyph commented 8 years ago

I'm continuing to run my reproducer: have a job queued on Edison (to see whether this also fails), and am building on Cori without "-DCOLUMN_OPENMP". I'll be offline for much of the rest of today, but will report back results as I have the opportunity.

worleyph commented 8 years ago

My Cori job built without "-DCOLUMN_OPENMP" (and with the original thread stack size) worked fine.

worleyph commented 8 years ago

Also, comparing atm.log for the build with and without "-DCOLUMN_OPENMP", the nstep output already differs by nstep=1. Since vthreads is == 1, the expectation is that this would be bit-for-bit, isn't it?

(with "-DCOLUMN_OPENMP")

  nstep, te        1   0.33434733703741007E+10   0.33433343412495546E+10  -0.76886859819777350E-02   0.98519958019889353E+05

vs. (without "-DCOLUMN_OPENMP")

  nstep, te        1   0.33432892434965711E+10   0.33433343412495546E+10   0.24940275090983132E-02   0.98519958019822065E+05
rljacob commented 8 years ago

Is this a cori-only problem? It might be v16 of the Intel compiler. There is also a cori-only problem with ocean restart ( #774 )

worleyph commented 8 years ago

Is this a cori-only problem?

Edison experiment is in the queue. I'll try Titan later today as well. It doesn't "feel" like a Cori-only problem, but we'll know more by the end of the day.

worleyph commented 8 years ago

!!! Worked on Edison.

So Cori is just the problem child. I'll still check whether Titan passes this test. I'll also run without "-DCOLUMN_OPENMP" on Edison, to look at whether the numerics are the same.

worleyph commented 8 years ago

On Edision, atm.log is bit-for-bit whether compiled with or without "-DCOLUMN_OPENMP" for nstep=1. For nstep=2, it does differ though:

  nstep, te        2   0.33433588836394591E+10   0.33433358640319705E+10  -0.12730496217083795E-02   0.98519712566572896E+05

vs.

  nstep, te        2   0.33433588817648101E+10   0.33433358632926588E+10  -0.12729868340587436E-02   0.98519712587624846E+05

So, is this expected?

mt5555 commented 8 years ago

Not expected, and I believe this is a problem.

I merged in the nested openmp pull request after it passed all the tests on 'next', under the assumption that we tested threaded configurations, but I now suspect this is a large gap in our testing and that -DCOLUMN_OPENMP has not been properly tested.

amametjanov commented 8 years ago

Our default PE configuration for the WCYCL compset was set to pure-MPI mode to get the watercycle prototype running out-of-the-box. This is probably why acme_integration test suite on skybridge didn't flag the nested openmp PR. I'll submit a PR to modify default WCYCL PEs for Cori and Edison based on Pat's recent small PE baselines. Skybridge nightly test should flag it.

worleyph commented 8 years ago

Job ran on Titan; I am in the process of building/rerunning without "-DCOLUMN_OPENMP" to look at numerics. Note that the finished job was slower in CAM than in my previous performance studies - given the potential performance variability, this may not be "real". However, performance of the other components (ice, ocean, land) was not significantly different.

worleyph commented 8 years ago

So, I am running so many experiments that I am concerned that I not keeping track correctly, however ...

on Titan, running with and without "-DCOLUMN_OPENMP" for A_WCYCL2000.ne30_oEC for a threaded run, I am getting bit-for-bit in the atm.log through the end of 5 days. So, the issue is with the Intel compiler and/or compiler flags.

That said, the run built with "-DCOLUMN_OPENMP" is slower than without by around 18% for this PE layout (comparing my recent runs), so there are two data indicating this.

amametjanov commented 8 years ago

Pat, can you try a run with vthreads set to NTHRDS_ATM from env_mach_pes.xml?

worleyph commented 8 years ago

This particular run (on Titan) uses 2700 processes and 4 threads per process, so twice as many as can be used in the dynamics without nesting. Do you want me to set vthreads to "2" or "4", or run a different case, e.g. 2700x2?

amametjanov commented 8 years ago

If baseline nthrds_atm is 4, please try with vthreads=4 in user_nl_cam and nthrds_atm unchanged. Physics will use 4 threads, horizontal threads in dynamics will use 1, vertical -- 4.

worleyph commented 8 years ago

Perhaps we need to create a new issue (or issues). On Edison, threaded with "-DCOLUMN_OPENMP", running without "-DCOLUMN_OPENMP", and running MPI-only (same number of nodes and process placement as in threaded) all diverge in nstep 2 at around the same number of digits.

I have an MPI-only job in the queue on Titan, to see whether this is also be b4b for this experiment. The Cori failure appears to also be a numerical issue? But one that is tickled by the column openmp restructuring? Perhaps the code is not broken, but rather we were lucky before? I'll try an MPI-only on Cori next.

mt5555 commented 8 years ago

It could very well be that threading is not BFB (independent of the new COLUMN_OPENMP code), and we never noticed because it's not tested. It could even be one of the other WCYCL components is not BFB w.r.t. changing threads, and the COLUMN_OPENMP code is fine.

But it seems like we need to

  1. disable COLUMN_OPENMP,
  2. get some threading tests into ACME,
  3. get ACME to pass those thread tests,
  4. then enabling COLUMN_OPENMP?
worleyph commented 8 years ago

@amametjanov , running with vthreads=4 is a little slower than the default (vthreads=1), and slower still than when not using -DCOLUMN_OPENMP. Note that (this is very hazy) it is my understanding that you have to do something special to enable nested OpenMP on Titan when using PGI? You should ask the help@nccs.gov about this. I have an e-mail from September 15 referring to a different code trying to use nested OpenMP. I'll forward it to you. You should verify what it says before acting on it.

worleyph commented 8 years ago

@mt5555 and @mrnorman , is @amametjanov 's nested OpenMP relevant to Titan, or does the openacc work make this irrelevant? If the latter, perhaps it is a waste of time to track down how to get this to work on Titan?

mt5555 commented 8 years ago

I'm pretty sure nested openMP will NOT be used on titan.

However, we did intend to remove the ifdefs around all the nested/2nd level omp directives for code simplicity. But if the presence of these calls hurts performance, we might need to keep all those ifdefs?

worleyph commented 8 years ago

Probably the wrong place, but since the current discussion is on nested OpenMP (-DCOLUMN_OPENMP), just a note that we need to add a check against setting vthread incorrectly. For example, on Titan with a 675x2 decomposition in the atmosphere I accidentally set vthreads = 4. This failed with the error message:

 0: Subscript out of range for array dom_mt (xxx/cam/src/dynamics/se/dyn_comp.F90: 283)
      subscript=1, lower bound=0, upper bound=-1, dimension=1

and

 0: Subscript out of range for array elem (xxx/cam/src/dynamics/se/dyn_comp.F90: 327)
     subscript=675, lower bound=1, upper bound=8, dimension=1

@amametjanov , if you agree that this needs to be fixed, please create a new github issue. We can also look at removing -DCOLUMN_OPENMP for PGI and PGI_ACC, unless you believe that the performance issues are Titan-specific.

worleyph commented 8 years ago

I had some experiments complete on Edison (have had a job in the queue on Cori without any progress all day).

a) in ATM, '-fp-model source' and '-fp-mode precise' generate identical results (looking at atm.log through 240 steps). Since this is a coupled run, I assume that the same is true for the other components, though I have not tried to verify this.

b) There was no evidence of performance degradation from building with nested OpenMP support (-DCOLUMN_OPENMP). Ran with default vthreads (vthreads=1), and performance seemed pretty much unchanged.

c) Threading was pretty minimal in this case (2700x2 ATM). I did try running with vthreads = 2, and performance was worse in this case, all in the dynamics as far as I can tell. @amametjanov , what should I expect? Shall I try 5400x1 (so that thre is no horizontal threading in the dynamics unless we use the vertical). Is there a way to tell if vertical threading is working? (Any timers inside of vertical threaded regions?) I can add one to test. Also, the numerics with vthreads=1 and vthreads=2 are identical in these runs.

mt5555 commented 8 years ago

did you turn off hyperthreading?

The most interesting run for me would be: 5400x1, but using 675 nodes. So you have 8 elements per node, and 8 MPI ranks per node. Then use vthreads=4 (or vthreads=8 with hyperthreading)

worleyph commented 8 years ago

a) hyperthreading on

b) do you mean 5400x3 (no hyperthreading) or 5400x6 (with hyperthreading), and then vthreads=3 or 6? Edison has 24 core (48 hyperthreads) per node.

mt5555 commented 8 years ago

Sorry, I was thinking Cori.

On Edison, I think pure MPI will perform the best, and vthreads is the only way to improve this. Thus an extreme case would be 6 elements per node, each with 4 vthreads. Using Az' notation (MPI x hthread x vthread):

5400x1x4 on 900 nodes 2700x2x2 on 900 nodes 5400x1x8 on 900 nodes, with hyperthreading

worleyph commented 8 years ago

I built ACME with "-fp-model precise", and it died even more quickly on Cori (in initialization, rather than after timestep 0); I then built with "-fp-model strict", and it died in the same general location as the run with "-fp-model source". The numerics diverge between strict and source already by step 1.

However, this latest run actually generated an error message:

 2286:  Error: Forcing height is below canopy height for pft index
 2286:  calling getglobalwrite with decomp_index=       185290  and clmlevel= pft
 2286:  local  pft      index =       185290
 2286:  ERROR: get_proc_bounds ERROR: Calling from inside  a threaded region
 2286: Image              PC                Routine            Line        Source
 2286: cesm.exe           00000000048C6E53  Unknown               Unknown  Unknown
 2286: cesm.exe           0000000003579606  shr_sys_mod_mp_sh         230  shr_sys_mod.F90
 2286: cesm.exe           0000000002740EF0  decompmod_mp_get_         226  decompMod.F90
 2286: cesm.exe           0000000002B206B6  getglobalvaluesmo          43  GetGlobalValuesMod.F90
 2286: cesm.exe           0000000002B201BA  getglobalvaluesmo         130  GetGlobalValuesMod.F90
 2286: cesm.exe           0000000002713180  abortutils_mp_end          69  abortutils.F90
 2286: cesm.exe           0000000002A92359  canopyfluxesmod_m         706  CanopyFluxesMod.F90
 2286: cesm.exe           00000000027159AF  clm_driver_mp_clm         516  clm_driver.F90

Since this is in the land, I am uncertain why -DCOLUMN_OPENMP would have any effect there. Perhaps @amametjanov or @mt5555 might have some insight. Also @bishtgautam , any insight into the canopy height error message - expected? innocuous? serious? (Only showed up in this build with strict.)

worleyph commented 8 years ago

Since this has turned into a general discussion of nested OpenMP, some new results on Edison:

I added a timer inside of a few of the COLUMN_OPENMP loops, just to verify that threading was occurring. This was with 2700x2 and vthreads=2 (so 2700x1x2 by @amametjanov's conventions). Threading did appear to be happening. However, for this case, the loops being parallelized are very fast already, and not prime candidates for threading?

 Thd                                             On  Called Recurse   Wallclock          max          min     UTR Overhead
 000 col_omp_test3                               -   312336    -       0.099435     0.000048     0.000000         0.019957
 001 col_omp_test3                               -   312336    -       0.061617     0.000064     0.000000         0.019957
 SUM col_omp_test3                               -   624672    -       0.161052     0.000064     0.000000         0.039914

The estimated timer overhead is an underestimate, so the true cost is even smaller. (This is in advance_hypervis_dp, and the cost for the whole routine call is 5.4 seconds.) I looked at all three of the threaded loops in this routine, and the others were even faster.

I then ran with vthreads=1, to see how the timings changed, and the timers died, taking down the code as well. It appears that the thread id is 'indeterminate' within the nested OpenMP loop when there is already threading external to the loop, even when vthreads = 1.

At the minimum, we should not put timers inside of inner nested OpenMP loops (probably a good idea anyway), but I do not know if this is an indicator of something else going on that needs to be fixed. Note that I have not drilled done to definitely diagnose this, but there are GPTL error messages that seem to imply my current hypothesis.

worleyph commented 8 years ago

Just checked, A_WCYCL2000 ne30_oEC continues to run for 5 days on Cori when compiled without -DCOLUMN_OPENMP, including with '-fp-model strict'.

worleyph commented 8 years ago

One last Edison test today ... put timers outside of inner OpenMP loops, and compared time when running with two threads in the outer loop and 1 in the inner:

 col_omp_test3x     -     4338    -       0.060451     0.000063     0.000008         0.000250

 000 col_omp_test3x       -     4338    -       0.060451     0.000063     0.000008         0.000250
 001 col_omp_test3x       -     4338    -       0.063689     0.000151     0.000007         0.000250
 SUM col_omp_test3x       -     8676    -       0.124140     0.000151     0.000007         0.000501

and when running 1 thread in the outer and two in the inner:

 col_omp_test3x     -     8676    -       0.109782     0.002048     0.000008         0.000745

So the inner-only parallel efficiency is less than that of the outer-only loop efficiency (think that @amametjanov wanted me to look at that, in one of the may e-mails). Perhaps there is a thread start-up cost that hurts threading this particular loop?

@mt5555 is interested in inner loop threading when all of the outer parallelism has been assigned to MPI (in the dynamics). That will require a different experiment, but these particular loops are pretty lightweight, so are unlikely to speed up much more. I'll construct this experiment, but it may sit in the Edison queue for awhile.

Nested OpenMP is clearly a critical technology, but it is not doing much on Edison, is counter productive on Titan, and breaks on Cori (probably a Cori issue, however). Not sure what the next move is, but will let @amametjanov and @mt5555 take it from here.