Closed worleyph closed 8 years ago
Checked out fresh version of master. Also dies, in/after nstep 14 in atmosphere or ice (overlapped for 0-2047 processes). Since only ATM has threading, probably ATM.
srun: error: nid01056: task 1208: Exited with exit code 174
srun: error: nid01084: task 1722: Exited with exit code 174
And I then modified env_mach_specific to make this MPI-only (but same layout and same number of processes per node), and this worked fine. So it is an OpenMP issue, and a recent one?
Thanks Pat. Trying to re-produce on a smaller case with -compset FC5AV1C-01 -res ne30_oEC
and the same user_nl_cam.
Couldn't open your run-dir. Based on the stack-trace, there are no OpenMP constructs in those routines, which might mean that the default stack size of 64M in env_mach_specific is not enough.
Please also check out
~worleyph/ACME/master/ACME/cime/scripts/A_WCYCL2000.ne30_oEC.corip1_master_DJb2
/global/cscratch1/sd/worleyph/acme_scratch/A_WCYCL2000.ne30_oEC.corip1_master_DJb2
(failed, with OpenMP)
~worleyph/ACME/master/ACME/cime/scripts/A_WCYCL2000.ne30_oEC.corip1_master_DJb3
/global/cscratch1/sd/worleyph/acme_scratch/A_WCYCL2000.ne30_oEC.corip1_master_DJb3
(worked, without OpenMP). I have changed permissions.
Note that the only differences between the working and non-working source are in the following:
cime/driver_cpl/driver/cesm_comp_mod.F90
cime/machines-acme/config_compilers.xml
components/cam/bld/build-namelist
components/cam/bld/configure
components/cam/bld/namelist_files/namelist_defaults_cam.xml
components/cam/bld/namelist_files/namelist_definition.xml
components/cam/src/dynamics/se/dyn_comp.F90
components/clm/src/biogeochem/CNAllocationMod.F90
components/clm/src/biogeochem/CNNitrogenFluxType.F90
components/clm/src/biogeochem/PhosphorusFluxType.F90
components/clm/src/cpl/lnd_comp_esmf.F90
components/clm/src/cpl/lnd_comp_mct.F90
components/clm/src/main/clm_driver.F90
components/clm/src/main/clm_initializeMod.F90
components/homme/src/share/namelist_mod.F90
components/homme/src/share/thread_mod.F90
Nothing obviously would require additional stack size, though perhaps just the option of using nested OpenMP could increase the thread stack requirements?
@amametjanov : can nested be disabled by removing "-DCOLUMN_OPENMP" from cam/bld/configure?
I ran the FC5AV1C compset with 4 threads on Edison, using master, it ran fine for 1 day: ./create_test -c SMS_Ld1_P96x4.ne30_ne30.FC5AV1C
I'm trying to run SMS.ne30_oEC.A_WCYCL2000 on Edison, with threads, next...
Yes, removing -DCOLUMN_OPENMP
from components/cam/bld/configure
will disable nested threading.
Reran Cori experiment with twice the threadstack size:
setenv OMP_STACKSIZE 128M
and it died in the same general way (after 12 timesteps?). It could be that the simulation is going off the rails, but I also got the same error when running with the new ocean coupling and ocean timestep, so that is not the source of the problem.
I'm continuing to run my reproducer: have a job queued on Edison (to see whether this also fails), and am building on Cori without "-DCOLUMN_OPENMP". I'll be offline for much of the rest of today, but will report back results as I have the opportunity.
My Cori job built without "-DCOLUMN_OPENMP" (and with the original thread stack size) worked fine.
Also, comparing atm.log for the build with and without "-DCOLUMN_OPENMP", the nstep output already differs by nstep=1. Since vthreads is == 1, the expectation is that this would be bit-for-bit, isn't it?
(with "-DCOLUMN_OPENMP")
nstep, te 1 0.33434733703741007E+10 0.33433343412495546E+10 -0.76886859819777350E-02 0.98519958019889353E+05
vs. (without "-DCOLUMN_OPENMP")
nstep, te 1 0.33432892434965711E+10 0.33433343412495546E+10 0.24940275090983132E-02 0.98519958019822065E+05
Is this a cori-only problem? It might be v16 of the Intel compiler. There is also a cori-only problem with ocean restart ( #774 )
Is this a cori-only problem?
Edison experiment is in the queue. I'll try Titan later today as well. It doesn't "feel" like a Cori-only problem, but we'll know more by the end of the day.
!!! Worked on Edison.
So Cori is just the problem child. I'll still check whether Titan passes this test. I'll also run without "-DCOLUMN_OPENMP" on Edison, to look at whether the numerics are the same.
On Edision, atm.log is bit-for-bit whether compiled with or without "-DCOLUMN_OPENMP" for nstep=1. For nstep=2, it does differ though:
nstep, te 2 0.33433588836394591E+10 0.33433358640319705E+10 -0.12730496217083795E-02 0.98519712566572896E+05
vs.
nstep, te 2 0.33433588817648101E+10 0.33433358632926588E+10 -0.12729868340587436E-02 0.98519712587624846E+05
So, is this expected?
Not expected, and I believe this is a problem.
I merged in the nested openmp pull request after it passed all the tests on 'next', under the assumption that we tested threaded configurations, but I now suspect this is a large gap in our testing and that -DCOLUMN_OPENMP has not been properly tested.
Our default PE configuration for the WCYCL compset was set to pure-MPI mode to get the watercycle prototype running out-of-the-box. This is probably why acme_integration test suite on skybridge didn't flag the nested openmp PR. I'll submit a PR to modify default WCYCL PEs for Cori and Edison based on Pat's recent small PE baselines. Skybridge nightly test should flag it.
Job ran on Titan; I am in the process of building/rerunning without "-DCOLUMN_OPENMP" to look at numerics. Note that the finished job was slower in CAM than in my previous performance studies - given the potential performance variability, this may not be "real". However, performance of the other components (ice, ocean, land) was not significantly different.
So, I am running so many experiments that I am concerned that I not keeping track correctly, however ...
on Titan, running with and without "-DCOLUMN_OPENMP" for A_WCYCL2000.ne30_oEC for a threaded run, I am getting bit-for-bit in the atm.log through the end of 5 days. So, the issue is with the Intel compiler and/or compiler flags.
That said, the run built with "-DCOLUMN_OPENMP" is slower than without by around 18% for this PE layout (comparing my recent runs), so there are two data indicating this.
Pat, can you try a run with vthreads set to NTHRDS_ATM from env_mach_pes.xml?
This particular run (on Titan) uses 2700 processes and 4 threads per process, so twice as many as can be used in the dynamics without nesting. Do you want me to set vthreads to "2" or "4", or run a different case, e.g. 2700x2?
If baseline nthrds_atm is 4, please try with vthreads=4
in user_nl_cam and nthrds_atm unchanged. Physics will use 4 threads, horizontal threads in dynamics will use 1, vertical -- 4.
Perhaps we need to create a new issue (or issues). On Edison, threaded with "-DCOLUMN_OPENMP", running without "-DCOLUMN_OPENMP", and running MPI-only (same number of nodes and process placement as in threaded) all diverge in nstep 2 at around the same number of digits.
I have an MPI-only job in the queue on Titan, to see whether this is also be b4b for this experiment. The Cori failure appears to also be a numerical issue? But one that is tickled by the column openmp restructuring? Perhaps the code is not broken, but rather we were lucky before? I'll try an MPI-only on Cori next.
It could very well be that threading is not BFB (independent of the new COLUMN_OPENMP code), and we never noticed because it's not tested. It could even be one of the other WCYCL components is not BFB w.r.t. changing threads, and the COLUMN_OPENMP code is fine.
But it seems like we need to
@amametjanov , running with vthreads=4 is a little slower than the default (vthreads=1), and slower still than when not using -DCOLUMN_OPENMP. Note that (this is very hazy) it is my understanding that you have to do something special to enable nested OpenMP on Titan when using PGI? You should ask the help@nccs.gov about this. I have an e-mail from September 15 referring to a different code trying to use nested OpenMP. I'll forward it to you. You should verify what it says before acting on it.
@mt5555 and @mrnorman , is @amametjanov 's nested OpenMP relevant to Titan, or does the openacc work make this irrelevant? If the latter, perhaps it is a waste of time to track down how to get this to work on Titan?
I'm pretty sure nested openMP will NOT be used on titan.
However, we did intend to remove the ifdefs around all the nested/2nd level omp directives for code simplicity. But if the presence of these calls hurts performance, we might need to keep all those ifdefs?
Probably the wrong place, but since the current discussion is on nested OpenMP (-DCOLUMN_OPENMP), just a note that we need to add a check against setting vthread incorrectly. For example, on Titan with a 675x2 decomposition in the atmosphere I accidentally set vthreads = 4. This failed with the error message:
0: Subscript out of range for array dom_mt (xxx/cam/src/dynamics/se/dyn_comp.F90: 283)
subscript=1, lower bound=0, upper bound=-1, dimension=1
and
0: Subscript out of range for array elem (xxx/cam/src/dynamics/se/dyn_comp.F90: 327)
subscript=675, lower bound=1, upper bound=8, dimension=1
@amametjanov , if you agree that this needs to be fixed, please create a new github issue. We can also look at removing -DCOLUMN_OPENMP for PGI and PGI_ACC, unless you believe that the performance issues are Titan-specific.
I had some experiments complete on Edison (have had a job in the queue on Cori without any progress all day).
a) in ATM, '-fp-model source' and '-fp-mode precise' generate identical results (looking at atm.log through 240 steps). Since this is a coupled run, I assume that the same is true for the other components, though I have not tried to verify this.
b) There was no evidence of performance degradation from building with nested OpenMP support (-DCOLUMN_OPENMP). Ran with default vthreads (vthreads=1), and performance seemed pretty much unchanged.
c) Threading was pretty minimal in this case (2700x2 ATM). I did try running with vthreads = 2, and performance was worse in this case, all in the dynamics as far as I can tell. @amametjanov , what should I expect? Shall I try 5400x1 (so that thre is no horizontal threading in the dynamics unless we use the vertical). Is there a way to tell if vertical threading is working? (Any timers inside of vertical threaded regions?) I can add one to test. Also, the numerics with vthreads=1 and vthreads=2 are identical in these runs.
did you turn off hyperthreading?
The most interesting run for me would be: 5400x1, but using 675 nodes. So you have 8 elements per node, and 8 MPI ranks per node. Then use vthreads=4 (or vthreads=8 with hyperthreading)
a) hyperthreading on
b) do you mean 5400x3 (no hyperthreading) or 5400x6 (with hyperthreading), and then vthreads=3 or 6? Edison has 24 core (48 hyperthreads) per node.
Sorry, I was thinking Cori.
On Edison, I think pure MPI will perform the best, and vthreads is the only way to improve this. Thus an extreme case would be 6 elements per node, each with 4 vthreads. Using Az' notation (MPI x hthread x vthread):
5400x1x4 on 900 nodes 2700x2x2 on 900 nodes 5400x1x8 on 900 nodes, with hyperthreading
I built ACME with "-fp-model precise", and it died even more quickly on Cori (in initialization, rather than after timestep 0); I then built with "-fp-model strict", and it died in the same general location as the run with "-fp-model source". The numerics diverge between strict and source already by step 1.
However, this latest run actually generated an error message:
2286: Error: Forcing height is below canopy height for pft index
2286: calling getglobalwrite with decomp_index= 185290 and clmlevel= pft
2286: local pft index = 185290
2286: ERROR: get_proc_bounds ERROR: Calling from inside a threaded region
2286: Image PC Routine Line Source
2286: cesm.exe 00000000048C6E53 Unknown Unknown Unknown
2286: cesm.exe 0000000003579606 shr_sys_mod_mp_sh 230 shr_sys_mod.F90
2286: cesm.exe 0000000002740EF0 decompmod_mp_get_ 226 decompMod.F90
2286: cesm.exe 0000000002B206B6 getglobalvaluesmo 43 GetGlobalValuesMod.F90
2286: cesm.exe 0000000002B201BA getglobalvaluesmo 130 GetGlobalValuesMod.F90
2286: cesm.exe 0000000002713180 abortutils_mp_end 69 abortutils.F90
2286: cesm.exe 0000000002A92359 canopyfluxesmod_m 706 CanopyFluxesMod.F90
2286: cesm.exe 00000000027159AF clm_driver_mp_clm 516 clm_driver.F90
Since this is in the land, I am uncertain why -DCOLUMN_OPENMP would have any effect there. Perhaps @amametjanov or @mt5555 might have some insight. Also @bishtgautam , any insight into the canopy height error message - expected? innocuous? serious? (Only showed up in this build with strict.)
Since this has turned into a general discussion of nested OpenMP, some new results on Edison:
I added a timer inside of a few of the COLUMN_OPENMP loops, just to verify that threading was occurring. This was with 2700x2 and vthreads=2 (so 2700x1x2 by @amametjanov's conventions). Threading did appear to be happening. However, for this case, the loops being parallelized are very fast already, and not prime candidates for threading?
Thd On Called Recurse Wallclock max min UTR Overhead
000 col_omp_test3 - 312336 - 0.099435 0.000048 0.000000 0.019957
001 col_omp_test3 - 312336 - 0.061617 0.000064 0.000000 0.019957
SUM col_omp_test3 - 624672 - 0.161052 0.000064 0.000000 0.039914
The estimated timer overhead is an underestimate, so the true cost is even smaller. (This is in advance_hypervis_dp, and the cost for the whole routine call is 5.4 seconds.) I looked at all three of the threaded loops in this routine, and the others were even faster.
I then ran with vthreads=1, to see how the timings changed, and the timers died, taking down the code as well. It appears that the thread id is 'indeterminate' within the nested OpenMP loop when there is already threading external to the loop, even when vthreads = 1.
At the minimum, we should not put timers inside of inner nested OpenMP loops (probably a good idea anyway), but I do not know if this is an indicator of something else going on that needs to be fixed. Note that I have not drilled done to definitely diagnose this, but there are GPTL error messages that seem to imply my current hypothesis.
Just checked, A_WCYCL2000 ne30_oEC continues to run for 5 days on Cori when compiled without -DCOLUMN_OPENMP, including with '-fp-model strict'.
One last Edison test today ... put timers outside of inner OpenMP loops, and compared time when running with two threads in the outer loop and 1 in the inner:
col_omp_test3x - 4338 - 0.060451 0.000063 0.000008 0.000250
000 col_omp_test3x - 4338 - 0.060451 0.000063 0.000008 0.000250
001 col_omp_test3x - 4338 - 0.063689 0.000151 0.000007 0.000250
SUM col_omp_test3x - 8676 - 0.124140 0.000151 0.000007 0.000501
and when running 1 thread in the outer and two in the inner:
col_omp_test3x - 8676 - 0.109782 0.002048 0.000008 0.000745
So the inner-only parallel efficiency is less than that of the outer-only loop efficiency (think that @amametjanov wanted me to look at that, in one of the may e-mails). Perhaps there is a thread start-up cost that hurts threading this particular loop?
@mt5555 is interested in inner loop threading when all of the outer parallelism has been assigned to MPI (in the dynamics). That will require a different experiment, but these particular loops are pretty lightweight, so are unlikely to speed up much more. I'll construct this experiment, but it may sit in the Edison queue for awhile.
Nested OpenMP is clearly a critical technology, but it is not doing much on Edison, is counter productive on Titan, and breaks on Cori (probably a Cori issue, however). Not sure what the next move is, but will let @amametjanov and @mt5555 take it from here.
This may be premature, but I would like help verifying, or not, the following issue. I have been running extensive experiments with different PE layouts, including different levels of threading, on Cori, Edison, and Titan. These have been working fine, with the version of master right after adding DRV_THREADING support.
I recently moved to the most recent version, with nested OpenMP support, and this is working fine with MPI only. However, a first test with MPI+OpenMP resulted in a failure of the following form. (Note, this did not specify vertical threads.)
one time, and just a failure without an error message in another:
I'll continue looking at this, but it would be helpful if someone else could reproduce the problem as I also have some other modifications in the code (which should be irrelevant, but who knows).
Again, the compset, resolution, and machine are
and the env_mach_pes.xml file
from https://acme-climate.atlassian.net/wiki/x/NoD4Ag . The user_nl_cam was also modified, as per instructions on this page. My case directory (with these files) is
Also note that I have not tried this particular experiment anywhere else yet, so do not know yet whether this is a Cori-specific problem.