Edison slow for ne30 A_WCYCL*

PeterCaldwell commented 7 years ago

@golaz and I have noticed that A_WCYCL* simulations have gotten ~1/3 slower on Edison than we had a few months ago using both the 173 (M) and 375 (L) default PE layouts. This is harming our ability to get the coupled tuning done.

I spent some time on Friday plotting the timings for beta0 through beta1.04 simulations on 375 nodes: http://portal.nersc.gov/project/acme/coupled/beta/getting_worse.png . This plot convinced me that machine issues rather than worsening code is responsible for this slowdown. I came to this conclusion because A). most of the simulation time is spent in atm, yet atm hasn't changed over subsequent beta simulations and B). all of the components which take appreciable time are getting slower by roughly the same rate, so it's unlikely to be any single code change that is causing problems.

However, the fact that atm takes ~2x more time than ocn, yet ocn gets its own dedicated nodes which sit idle for idle_time = ocn_time - (atm_time + lnd_time + cpl_time + etc) makes me think that these layouts aren't very optimal. Could someone check into improving this situation? We'd like faster configurations which take ~175 nodes (the max size that fits into our 'special_acme' queue) and 375 nodes or so.

In case you want to look at detailed timing info, the simulations I used for the plot were beta1, beta1_02, beta1_03, beta1_04 under /scratch2/scratchdirs/golaz/ACME_simulations. I'm personally struggling with /global/cscratch1/sd/petercal/ACME_simulations/20170331.r5a_gamma0.3.ne30_oECv3_ICG.edison/

jonbob commented 7 years ago

@PeterCaldwell - if no one from the performance group is available, I can work on this (as soon as edison is back up)...

PeterCaldwell commented 7 years ago

I'm wondering in particular if we could request more than 5400 cores for physics by only running dynamics on a subset of the processors requested for atm. My understanding is that NCAR is already doing this ( @gold2718 - do you have a comment on this?). I've been looking at overdecomposing atmos in this way recently with Aaron Donahue ( @lazarusM3B ) as part of our parallel phys/dyn effort and it seems easy to get >20% speedup this way. If we're idling tons of our cores anyways, why not use them for atmos?

worleyph commented 7 years ago

@PeterCaldwell , you get the behavior you are describing by using more OpenMP threads , e.g. 5400x2 or 5400x4 (5400 will use all of the MPI parallelism in the dynamics, and the threads are useful in the physics.)

Running dynamics on a subset of processes has been supported for FV and EUL for a long time, but was broken at some point for SE. Perhaps NCAR fixed this?

gold2718 commented 7 years ago

I think this works in CESM but it is broken in ACME and the decision seems to have been to not fix it now (https://github.com/ACME-Climate/ACME/issues/1083). I will probably have to fix this at some point to get elevation classes working but that is months down the road.

PeterCaldwell commented 7 years ago

Is the OpenMP threading you're talking about the "vertical threading" I keep hearing about? It seems like the performance improvements from that were disappointing (though maybe I'm remembering this incorrectly)?

I was actually talking about running dynamics on a subset of processes. It was broken in SE but Aaron Donahue managed to fix it... Apparently there are just a few lines of code which need to be changed. We would be happy to issue a PR with this change if the Performance team was interested in testing it.

worleyph commented 7 years ago

Is the OpenMP threading you're talking about the "vertical threading" I keep hearing about?

No. Just use a PE layout with 5400x2. The extra thread will be idle in the dynamics and used in the physics. This is how things have worked for a long time, and is why running the dynamics on a subset of processes has not been important since OpenMP support (in the code and on the computer systems) became reasonable.

worleyph commented 7 years ago

We would be happy to issue a PR with this change if the Performance team was interested in testing it.

I'd be interested in testing it. Always nice to have another tuning knob. Dynamics will likely have lower MPI overhead if packed into a fewer number of nodes. But d_p/p_d_coupling is more expensive. Hard to predict which will be better.

PeterCaldwell commented 7 years ago

Ok, we will make that change.

worleyph commented 7 years ago

@PeterCaldwell , note that if you do enable vertical threading, then 5400x2 will have the extra thread do something useful in the dynamics as well, but this is not required to get both threads working in the physics.

mt5555 commented 7 years ago

I think using threads is the better way to go. But for the record, you can use more than 5400 MPI tasks - you just have to set a namelist variable (dyn_npes = 5400) that specifies how many of the MPI tasks should be used for dynamics. #1083 fixed a segfault if the user didn't specify dyn_npes.

gold2718 commented 7 years ago

I thought the 'fix' of @1083 was to abort. In prim_init1 (prim_driver_mod.F90), we have:

       ! we want to exit elegantly when we are using too many processors.
       if (nelem < par%nprocs) then
          call abortmp('Error: too many MPI tasks. set dyn_npes <= nelem')
       end if

mt5555 commented 7 years ago

That was an early version - I think this was the code that finally went in: https://github.com/ACME-Climate/ACME/commit/a751aa755ac5a296e7a1fdebf409f348395e5793

! we want to exit elegantly when we are using too many processors.
if (nelem < par%nprocs) then
call abortmp('Error: too many MPI tasks. set dyn_npes <= nelem')
end if

gold2718 commented 7 years ago

These two code snippets are the same and mine if from the current master.

mt5555 commented 7 years ago

oh right, sorry about that :-)

so if you dont set dyn_npes, you now get that helpfull message and the code aborts. If you do set dyn_npes correctly, then the code runs and the physics might actually run a little faster.

amametjanov commented 7 years ago

According to @helenhe40, our scripts may need to export FORT_BUFFERED=yes -- probably in env_mach_specific.xml. This variable used to be set and was recently removed, because Intel v17 was failing when it was set. When it is not set, the IO may slow down with any Intel compiler version. This may explain the slower throughput.

More details: https://www.nersc.gov/users/computational-systems/edison/updates-and-status/open-issues/fortran-buffered-io-with-intel-compilers-is-no-longer-enabled-by-default-on-edison/

ndkeen commented 7 years ago

Good catch @amametjanov. Can we track the onset of slow performance to around March 16th?

AaronDonahue commented 7 years ago

When I first started experimenting with running on larger core counts than the number of elements I ran into that error message and subsequently changed the dyn_npes as suggested. This led to a segmentation fault which I tracked down to the boundary exchange in homme;

homme/src/share/bndry_mod_base.F90

I plan on issuing a PR that will show what I had to do to get the code to run on a greater # of cores than elements. I'm not sure if my case is unique or if the same issue would occur when running ne30 on more than 5400 cores, I am running on ne4 for faster turnaround.

mt5555 commented 7 years ago

@lazarusM3B , that would be great. I thought @erichlf has this working a few months ago, but I could be wrong. As we dont test this, it could have broken again as well.

PeterCaldwell commented 7 years ago

Back on the topic of the need for faster layouts on Edison: I just remembered that there isn't a standard layout for F compsets on Edison at all. I tried to make my hyperthreaded 114 node version the default, but it failed some tests and never made it onto master...

worleyph commented 7 years ago

However, the fact that atm takes ~2x more time than ocn, yet ocn gets its own dedicated nodes which sit idle for idle_time = ocn_time - (atm_time + lnd_time + cpl_time + etc) makes me think that these layouts aren't very optimal.

@PeterCaldwell , if you look at the https://acme-climate.atlassian.net/wiki/x/fwDTBg, this inefficiency has always been part of these layouts. I don't remember why at the moment, but perhaps the original goal was to get something reasonable, at which point we stopped (because too much code was in flux to spend any more time on this)? Or because the model had changed between the time I tried these PE layouts originally and the time you started using them and posted these data?

PeterCaldwell commented 7 years ago

Interesting, Pat. I was wondering whether the ocean had just gotten a lot faster since the layouts were created, or whether this was always the case. I don't see why we didn't just reduce the total core count by reducing ocn nodes by a factor of 1/3 when we put those layouts together...

It seems like there has been a disconnect for a long time between the machine people are actually using (edison) and the machines the performance team has focused on. I hope that can stop...

worleyph commented 7 years ago

The disconnect probably started when the the performance variability on Edison went so high that there was no way that anything could be optimized. We still can't do much about system issues (as evidenced by the complete lack of success in trying to work with NERSC to remedy these issues), though the message from Helen that Az resent of I/O may help here?

In any case, I had LOTS of PE layouts that I looked at way back when, and they did not have the ocean load imbalance. I can't find any of my benchmarks runs that use the PE layouts that you posted, but I am still looking.

AaronDonahue commented 7 years ago

I have created a pull request with my fix #1393.

PeterCaldwell commented 7 years ago

I mentioned this yesterday on another thread, but creating a new PE layout for coupled Edison runs should only be done after the ocean timestep has been increased from 15 to 30 min (which I think just got onto master this second).

worleyph commented 7 years ago

@PeterCaldwell , fyi, found some performance data for the 375 node PE layout, 1 month, no restart writes:

May 18, 2016 (-compset A_WCYCL2000 -res ne30_oEC):

TOT Run Time:     705.910 seconds       22.771 seconds/mday        10.40 myears/wday
LND Run Time:      15.107 seconds        0.487 seconds/mday       485.74 myears/wday
ROF Run Time:       2.848 seconds        0.092 seconds/mday      2576.57 myears/wday
ICE Run Time:     135.677 seconds        4.377 seconds/mday        54.08 myears/wday
ATM Run Time:     528.269 seconds       17.041 seconds/mday        13.89 myears/wday
OCN Run Time:     475.673 seconds       15.344 seconds/mday        15.43 myears/wday
GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
CPL Run Time:      36.860 seconds        1.189 seconds/mday       199.08 myears/wday
CPL COMM Time:    287.009 seconds        9.258 seconds/mday        25.57 myears/wday

Feb. 20, 2017 (-compset A_WCYCL2000S -res ne30_oECv3):

TOT Run Time:     738.844 seconds       23.834 seconds/mday         9.93 myears/wday
LND Run Time:      14.774 seconds        0.477 seconds/mday       496.69 myears/wday
ROF Run Time:       2.869 seconds        0.093 seconds/mday      2557.71 myears/wday
ICE Run Time:     186.304 seconds        6.010 seconds/mday        39.39 myears/wday
ATM Run Time:     502.757 seconds       16.218 seconds/mday        14.60 myears/wday
OCN Run Time:     253.576 seconds        8.180 seconds/mday        28.94 myears/wday
GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
CPL Run Time:      56.548 seconds        1.824 seconds/mday       129.77 myears/wday
CPL COMM Time:    337.610 seconds       10.891 seconds/mday        21.74 myears/wday

So, ocean cost was halved (based on these results) during this 6 months interval, and ICE went up a little? However, I have two performance results for Feb. 20, and the first one was 8.52 SYPD, with the extra cost all in CPL Run (which tells me nothing).

Compared to one of the latest production runs:

TOT Run Time:   99111.623 seconds       33.942 seconds/mday         6.97 myears/wday
LND Run Time:    1385.063 seconds        0.474 seconds/mday       499.04 myears/wday
ROF Run Time:     240.508 seconds        0.082 seconds/mday      2873.92 myears/wday
ICE Run Time:   26895.565 seconds        9.211 seconds/mday        25.70 myears/wday
ATM Run Time:   61321.149 seconds       21.000 seconds/mday        11.27 myears/wday
OCN Run Time:   29411.619 seconds       10.072 seconds/mday        23.50 myears/wday
GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
CPL Run Time:    9872.990 seconds        3.381 seconds/mday        70.01 myears/wday
CPL COMM Time:  48149.583 seconds       16.490 seconds/mday        14.36 myears/wday

ICE, ATM, OCN, and CPL Run are all slower. OCN is masked by the others, so is not important here.

Just looking at ATM for process 0 (140160 calls)

 a:CAM_run2    4529
 a:CAM_run3   12107
 a:CAM_run4   12533
 a:CAM_run1    25328

and a:CAM_run1 has a max/min over processes of 32231 / 12220, with an average of 20957. phyiscs load balancing would likely help here.

Looking at ATM from February for process 0 (1489 calls)

 a:CAM_run2     46
 a:CAM_run3    116
 a:CAM_run4     26
 a:CAM_run1    248

so run2, run3, and run1 appear to be very similar (based on number of calls). The big difference in is run4, i.e. I/O (whether real I/O or MPI communication associated with it). Since MPI overhead appears to be the same for the other components, perhaps this is real I/O cost.

Summary: @amametjanov 's suggestion may solve the new performance problems.

philipwjones commented 7 years ago

@PeterCaldwell , WTF? "It seems like there has been a disconnect for a long time between the machine people are actually using (edison) and the machines the performance team has focused on. I hope that can stop..."

Did we not respond to your request? And which "people" are you talking about? ACME experiments are being run across 5 different machines, of which Edison is only one. Over the past few months, we have been:

fixing ACME on Titan where ACME was completely broken and which "people" were sitting dead in the water
trying to fix ACME issues on Mira so that "people" can shift high-res work there
trying to get ACME running on Cori-KNL where we had free time and "people" (including you I think?) were running and on which the fully coupled system still doesn't run effectively
getting reasonable configurations on Anvil where "people" are running We are tackling the highest priority issues - particularly issues that are preventing the model from running - that come up and dealing with others as we can squeeze them in.

PeterCaldwell commented 7 years ago

I didn't mean that the performance team was doing a bad job, just that scientists (at least on the coupled team) and performance people don't seem to be communicating effectively. Here we are, using Edison for all of our coupled low-res runs (which Dave says are the highest priority for the project) and there isn't even a default PE layout at all on Edison and the A_WCYL cases end up idling their ocean cores 50% of the time. Clearly there's something wrong.

Is it the performance team's fault - no, because nobody created a github issue until now. And I wasn't blaming the performance team. You're totally misinterpreting me. I just tried to point out the problem so we can fix it. I'm definitely to blame for not being more forceful in asking for help earlier and I was implicitly including my own behavior in the things "I hope can stop". However, I had assumed that the performance team knew that all low-res coupled runs were happening on Edison and would be looking in on it occasionally to make sure things were working well. Lack of communication. Stuff to fix.

amametjanov commented 7 years ago

A 5-day run of ne30 A_WCYCL-compset on 375 nodes of Edison went faster with FORT_BUFFERED=yes:

before (2017-03-29): 9.60 sypd, 24.655 secs/day
after (2017-04-11): 10.20 sypd, 23.207 secs/day

While the PEs are being updated to also address the issue, this can be added to $casedir/env_mach_specific.xml:

    <env name="OMP_STACKSIZE">64M</env>
    <env name="FORT_BUFFERED" compiler="intel">yes</env>
  </environment_variables>

PeterCaldwell commented 7 years ago

Excellent, thanks Az! Interestingly, the benchmark for 375 nodes on https://acme-climate.atlassian.net/wiki/display/PERF/Benchmark+Results+and+Optimal+Layouts is 10.7 SYPD, so FORT_BUFFERED may not be the only source of slowdown...

amametjanov commented 7 years ago

10.7 sypd appears to be a typo, because the timing summary shows 10.24 -- updated the confluence page.

10.24 sypd is 23.109 secs/day -- faster than 23.207 secs/day by 0.098 secs -- small difference. I think it's best to try this in production runs with IO turned on. Benchmarks turn IO off.

Also, I previously did a 300-node test identical to the 375-node layout except OCN tasks were halved to 1800 tasks on 75 nodes -- to reduce the core-hours. That test ran at 7.62 sypd, 31.054 secs/day and 45,354 cpu-hours/year. So I think the current PE layout on 375 nodes is good.

PeterCaldwell commented 7 years ago

10.7 sypd appears to be a typo, because the timing summary shows 10.24 -- updated the confluence page.

Ok, I was wondering about that.

Also, I previously did a 300-node test identical to the 375-node layout except OCN tasks were halved to 1800 tasks on 75 nodes -- to reduce the core-hours. That test ran at 7.62 sypd, 31.054 secs/day and 45,354 cpu-hours/year. So I think the current PE layout on 375 nodes is good.

This result seems inconsistent with my finding at the top of the page that atm is taking much more time than ocn and with Pat's comment about https://acme-climate.atlassian.net/wiki/x/fwDTBg ... which itself seems inconsistent with the timings he copy/pasted later on this page which seem to show that the ocean has sped up significantly over the past 6 mo. In short, I'm confused. I also think it may be worth revisiting old runs?

worleyph commented 7 years ago

@PeterCaldwell , based on your previous comments and my looking at your data, I would guess that you are doing much more I/O than what was used in the performance benchmarks. As @amametjanov suggested, we need to "benchmark" your production case. It is not similar enough to the current benchmark configurations for the benchmarks to be used for diagnosis.

golaz commented 7 years ago

@worleyph : I totally agree that it would be much more useful if the performance tuning was done with our current I/O load. Here is the run script of one of our recent simulation 20170313.beta1_04.A_WCYCL1850S.ne30_oECv3_ICG.edison

PeterCaldwell commented 7 years ago

@worleyph and @amametjanov - the need to use production cases for benchmarking sounds like a good opportunity to try something I've been wanting to do for a while... next time @golaz or I do a production run, give us a couple PE layouts to try. Then we will use a different PE layout for each job submission and you will get timing information on the case we're actually running (without wasting tons of time and core hours). Unfortunately, I'm just finishing a huge pulse of runs and I'm not sure when I'll have more long runs to try this on.

golaz commented 7 years ago

@worleyph, @amametjanov, @philipwjones : can we get your input on @PeterCaldwell suggestion above?

philipwjones commented 7 years ago

@golaz - I believe I've stated in previous discussions that @PeterCaldwell 's idea would help us to improve coverage of benchmarks. And the use of actual production configurations would result in more useful results for you all. So yes, it's a good idea.

amametjanov commented 7 years ago

There are 3 different PE layouts and 2 proposed PRs to try in production runs:

173 nodes: https://acme-climate.atlassian.net/wiki/display/PERF/A_WCYCL1850-ne30_oEC-Edison-173
375 nodes: https://acme-climate.atlassian.net/wiki/display/PERF/A_WCYCL1850-ne30_oEC-Edison-375
683 nodes: https://acme-climate.atlassian.net/wiki/display/PERF/A_WCYCL1850-ne30_oEC-Edison-683
turn on buffering: PR #1398
turn on load-balancing: PR #1397

PeterCaldwell commented 7 years ago

Hi Az - we have tons of data on the first 2 PE layouts because those are the ones we've been using operationally for ages. See https://acme-climate.atlassian.net/wiki/display/SIM/20170313.beta1_04.A_WCYCL1850S.ne30_oECv3_ICG.edison for examples of 375 node cases and https://acme-climate.atlassian.net/wiki/display/SIM/20170331.r5a_gamma0.3.ne30_oECv3_ICG.edison for examples of 173 node cases. We will try the other options.

golaz commented 7 years ago

Here is some performance data from the last three segments of my beta1_04 simulation. beta0 used to get 10+ SYPD, now I'm happy to see 7.5. But as you can see, there are long periods when the model slows down further.

Incidentally (and perhaps not coincidentally), /scratch2 on Edison is currently 90% full. @ndkeen : out of curiosity, do you know if anyone at NERSC keeps an eye on scratch usage?

timing

jonbob commented 7 years ago

I tested a smaller configuration with fewer ocn processors -- which will not speed up the model, but perhaps make it more efficient. Going from the normal 1440 pes in the 173-node configuration to 720 pes cut the ocn throughput from ~22 to ~11.9 sypd, but did not impact the total model performance. I did, however, find an issue with the mpas-seaice analysis member settings (thanks to @akturner), and making that output more reasonable increased seaice throughput from 29.6 to 36.1sypd. I had a small error in my pe-layout, so I won't be able to say anything about overall performance and cost until my next test finishes.

PeterCaldwell commented 7 years ago

Great! Thanks Jon - this is exactly what I was asking for/about when I started this thread. When you do have a new PE layout, please pass it our way - it sounds like it will be an immediate improvement upon our current 375 node config.

jonbob commented 7 years ago

Well, this one is just for the 173-node layout -- I'll try something similar on the 375-node configuration next. It' all a little slow because the machines are extremely busy. And I'll also get the new defaults for seaice analysis members merged as soon as possible.

davidcbaderatllnl commented 7 years ago

I'll take the blame for the friction here. I told Chris and Peter that the highest priority for ACME was the final low-res coupled tuning, while at the same time, Mark and I have been putting pressure on the Performance group to get the high-res model running better, because our machine options for those simulations are more limited. Can you have two "top priorities?" Sorry - everyone is working hard, so let's not circle the wagons and turn the guns inward.

PeterCaldwell commented 7 years ago

@jonbob - a 173 node configuration would be great. Dave just told me to continue my r5 run out to 100 yrs so I have plenty of opportunity to try different configurations...

jonbob commented 7 years ago

@PeterCaldwell - OK, I'll give you at least a more efficient and slightly faster layout today. I'll ask Rob about getting the seaice analysis settings onto next and master asap -- it doesn't make a huge difference, but the overall model is about 5-6% faster. And mostly the model cost should go down significantly. I tried a couple of tests to see if the atm could be sped up with the processors the ocn no longer needs, but I think that's a trickier proposition.

golaz commented 7 years ago

Pinging @mt5555, @philipwjones and @mccoy20 to make sure we don't get blamed later for not raising the issue :)

The last segment (2 years) to complete 50 years of beta1_04 broke a new "speed record": 2.84 SYPD on Edison using 375 nodes. This is a PE layout that should get more than 10+ SYPD. To state the obvious: we will never be able to deliver anything close to what we promised if the situation does not improve.

Note that I was using /scratch2 on Edison, which is currently 90% full. This is reminds me of the performance issues we experienced last spring. They were never fully understood but were also correlated with high scratch usage.

philipwjones commented 7 years ago

Thanks @golaz (well...not really, didn't want to see this particular issue come up again). Also, I forgot to respond to an earlier question above. @bmayerornl does track filesystem use as part of his monitoring of allocations.

golaz commented 7 years ago

Here is the figure that goes along. Timing plots for the last 4 segment of my beta1_04 simulation.

timing

amametjanov commented 7 years ago

Another thing to try is to request a specific file system with #SBATCH --license=$file-sys, where $file-sys is one of scratch1, scratch2, scratch3, cscratch1, project, projecta, projectb or dna (full list is at https://www.nersc.gov/users/computational-systems/cori/running-jobs/specifying-required-file-systems/).

philipwjones commented 7 years ago

@ndkeen - once more into the breach?

E3SM-Project / E3SM

Edison slow for ne30 A_WCYCL* #1387