E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
355 stars 367 forks source link

(very) slow performance on Edison #820

Closed worleyph closed 7 years ago

worleyph commented 8 years ago

This, hopefully, is a transitory issue. However, the current discussion is buried across a few different Confluence pages and in e-mails with NERSC help (though we have yet to receive a response from NERSC).

Others have seen different, but all bad, performance issues. The initial indication was that I/O was very slow (which has happened on Edison before), but I am seeing everything being very slow.

Examples:

-compset FC5AV1C-01 -res ne30_oEC

a) 1 day for a 58 node PE layout (April 2):

  CPL:INIT                          40.992325
  CPL:RUN_LOOP                      53.270420
    CPL:ATM_RUN  -      48    -     44.915367     2.218216     0.782414

then today (April 5)

  CPL:INIT                          985.659424
  CPL:RUN_LOOP                     1360.697510
    CPL:ATM_RUN  -      48    -     988.505676    37.936245    12.435209

b) 2 days for a 684 node PE layout (April 3)

  CPL:INIT                         149.158356
  CPL:RUN_LOOP                      27.896242
    CPL:ATM_RUN  -      96    -     22.574593     2.544883     0.158755

then today

  CPL:INIT                           786.385925
  CPL:RUN_LOOP                      2449.291016
    CPL:ATM_RUN  -      96    -     1506.443604    39.947269     2.608297

This is very persistent (and seems to be getting worse over the last 2 days, if anything). I can provide more details, and will look through the data to see if anything pops up of interest. In any case, Edison is so slow now that it is practically unusable at the moment.

worleyph commented 8 years ago

Note that the above April 5 runs were using the thread affinity suggestions from Sam Williams. These did not help. Runs without these were similarly slow (but I did not capture timing data from these becuase they were so slow that I had not set the timing output frequency appropriately).

ndkeen commented 8 years ago

What would be the argument to create_test?

worleyph commented 8 years ago

I rarely use create test. This is purely performance, not functionality (I hope).

 ./create_newcase -case FC5AV1C-L.ne30_oEC.edison_intel_58 -compset FC5AV1C-L -res ne30_oEC -mach edison -compiler intel -project acme

and then grab the 58 node layout from https://acme-climate.atlassian.net/wiki/x/gAEgAw . Or grab the 684 node PE layout if you prefer (in which I would modify the case name to)

 ./create_newcase -case FC5AV1C-L.ne30_oEC.edison_intel_684 -compset FC5AV1C-L -res ne30_oEC -mach edison -compiler intel -project acme

You'll need to request at least 59 minutes to get any timing output for the 58 node run (based on today's performance).

ndkeen commented 8 years ago

I'd still like to know the create_test arg if anyone knows it. Has this been run on Cori?

rljacob commented 8 years ago

Tagging @helenhe40

jayeshkrishna commented 8 years ago

@ndkeen : You can create a case using create_newcase as mentioned above, setup+build+submit it. The create_test script is only for creating tests.

ndkeen commented 8 years ago

Was this particular problem running fine on edison at one point and then just suddenly started running slowly? This is not a 'test' ?

jayeshkrishna commented 8 years ago

Yes, the setup above is not a "test" (a case). Check the timings above from Pat that shows the change in timings (not just I/O) in the past 3 days.

worleyph commented 8 years ago

Update: a new test just finished (ran out of time) using a fresh checkout, 684 node PE layout, and default thread affinity. It was "faster", but perhaps just randomness?

Made it through 4 days before running out of time, but comparing to 2 day run performance data as before:

 CPL:INIT                           450.894501
 CPL:RUN_LOOP                      1262.003052
 CPL:ATM_RUN  -      96    -      755.404846    18.712259     0.160244

so only 34 times slower in the atmosphere, instead of 67 times as in the previous run.

worleyph commented 8 years ago

Note that one of the timers with a significant discrepancy is cice_run_initmd:

cice_run_initmd 1536 1536 2.949120e+05 2.888784e+05 214.260 ( 0 0) 172.892 ( 879 0)

as compared to

cice_run_initmd 1536 1536 2.949120e+05 3.813717e+02 0.293 ( 4 0) 0.212 ( 923 0)

and this looks to be primarily calls to global_sum, so collective communication. This could be load imbalance, due to I/O for example, but, if so, then the load imbalance moves around a lot since the minimum and the maximum are not that different.

I used to have an allreduce kernel to look at these sorts of issues, but I don't have time to pull this out now. Perhaps someone at NERSC can check whether the MPI collectives have changed recently.

ndkeen commented 8 years ago

From Pat: JobId=311413 was slow JobId=310901 was slow JobId=310747 was slow JobId=303364 was slow

JobId=302649 was fast

Original NERSC incident number: INC0082528

rljacob commented 8 years ago

Reminder: When Cori mysteriously started to be slow, it was ultimately linked to an Infiniband cable connection: https://github.com/ACME-Climate/ACME/issues/593#issuecomment-173759296

ndkeen commented 8 years ago

I emailed NERSC again just to see if they had heard of anything that might explain this (or complaints from others users). Woo-Sun (NERSC consultant) answered and I just spoke with him. Guess who else recently complained about this (very slow/variable timing)? A CESM user. Woo-Sun is trying to gather information to present to the edison folks.

ndkeen commented 8 years ago

I ran what Pat suggested and it seems to run quickly. jobid=312493

Init Time   :     361.193 seconds 
Run Time    :     164.853 seconds       32.971 seconds/day 
Final Time  :       0.476 seconds 
golaz commented 8 years ago

Here is some timing info from coupled low-res (ne30_oEC) simulations from @tangq and myself on Edison. The top panel shows the timing for each simulated day. The bottom panel, the integrated throughput, faster runs have smaller slopes.

There is a lot of variability between one simulation and another (even with identical core numbers), and also within a given simulation. In the most extreme case, it took over an hour to complete a single simulated day.

timing

ndkeen commented 8 years ago

Any tips on extracting the number of seconds elapsed per timestep from the timing data?

golaz commented 8 years ago

@ndkeen, I found the info in the cpl.log.* file. It looks like:

tStamp_write: model date = 10723 0 wall clock = 2016-04-05 15:21:25 avg dt = 376.15 dt = 510.71 tStamp_write: model date = 10724 0 wall clock = 2016-04-05 15:29:34 avg dt = 376.70 dt = 489.11

The relevant numbers are the model simulated date (10723) and last dt which is the time in seconds.

Here is my python script to generate the plot above

timing.py.txt

ndkeen commented 8 years ago

Thanks much!

edison10% grep tStamp cpl.log.160405-140314
 tStamp_write: model date =    10102       0 wall clock = 2016-04-05 14:10:39 avg dt =    37.72 dt =    37.72
 tStamp_write: model date =    10103       0 wall clock = 2016-04-05 14:11:05 avg dt =    31.65 dt =    25.58
 tStamp_write: model date =    10104       0 wall clock = 2016-04-05 14:11:30 avg dt =    29.61 dt =    25.54
 tStamp_write: model date =    10105       0 wall clock = 2016-04-05 14:11:56 avg dt =    28.66 dt =    25.80
 tStamp_write: model date =    10106       0 wall clock = 2016-04-05 14:12:46 avg dt =    32.94 dt =    50.06

I guess I need to run longer to get more data.

worleyph commented 8 years ago

Reminder: When Cori mysteriously started to be slow, it was ultimately linked to an Infiniband cable connection: #593 (comment)

I keep forgetting that this was Cori and not Edison. In any case, I think that they later decided that this was not the diagnosis after all.

worleyph commented 8 years ago

I ran what Pat suggested and it seems to run quickly. jobid=312493

Lucky you. My latest run, with MPICH_COLL_OPT_OFF == 1, was "between" my two other slow runs, so still slow.

 CPL:INIT                           888.324280
 CPL:RUN_LOOP                      1600.302612
 CPL:ATM_RUN  -      96    -      935.837524    24.422390     1.014124
ndkeen commented 8 years ago

Do we know if a separate line in the nice plot above is from a single batch job, or a simulation case that was completed in a few batch jobs (through restarts) that used different sets of compute nodes? Also, if you have job ID's that tends be a frequent question NERSC will ask.

ndkeen commented 8 years ago

Can I reduce the size of the problem so that I can use fewer nodes and get in/out of the queue faster?

worleyph commented 8 years ago

I saw the slow performance using the 58 nodes PE layout available from https://acme-climate.atlassian.net/wiki/x/gAEgAw. I assume that this is not that sensitive to the number of nodes.

golaz commented 8 years ago

@ndkeen, to clarify, each line in the plot above is a separate batch submission. The bluish lines are the same model configuration (153 cores). The first one (light blue) died after 6 months without producing restart files. It was rerun (dark blue) and ran for 9 months (but at a much slower pace). It was then restarted from existing restart (blue purple). That third submission ran even slower.

tangq commented 8 years ago

I think @golaz means "nodes" when using "cores" on the plot.

golaz commented 8 years ago

Thanks @tangq , my mistake. I meant "nodes" in the plot and comments above. Sorry for the confusion.

worleyph commented 8 years ago

Looking at the data more closely - this almost looks like thread placement issues? Many things are slower.

For example, the reproducible sum has the logic: a) compute maximum and minimum exponents across all summands (an allreduce, so essentially a barrier) b) local summation algorithm c) final allreduce to compute the global sum

(a) captures all load imbalance coming in, and the summation is pretty inexpensive, so (c) is a good measure of true allreduce cost.

In a fast run from a few days ago (58 nodes) for one simulated day:

                                     max over processes           min over processes
 (a) repro_sum_allr_minmax           0.377                        0.059 
 (b) repro_sum_loopb               < 0.001                      < 0.001
 (c) repro_sum_allr_i8               0.003                        0.002

And in a recent slow run for one simulated day:

 (a) repro_sum_allr_minmax          59.897                      34.912
 (b) repro_sum_loopb                 1.001                       0.189
 (c) repro_sum_allr_i8              26.009                      21.564
douglasjacobsen commented 8 years ago

@worleyph Do you mean OpenMP thread affinity to MPI tasks?

If so, do you know if you see the same performance strangeness when OpenMP is disabled?

worleyph commented 8 years ago

@worleyph Do you mean OpenMP thread affinity to MPI tasks?

I don't know what I mean - grasping for straws here. Does kind of remind me of the "old" days when using the Intel compiler and runtime assigned all threads to a single core if you did not do the magic incantation. Worth trying an MPI-only run, just to be sure. Some of my experiments included even more specific thread affinity and binding instructions than the default, but this changed nothing.

douglasjacobsen commented 8 years ago

Is OpenMP enabled in all of these runs on edison?

If so, you could look at timings from the MPAS components, since they don't have OpenMP enabled. That would tell you if your theory is something to explore more at least.

Meaning, if you see the ocean (for example) having drastically different timings then it's not related to thread affinity. While if the ocean remains relatively consistent, but the atmosphere is all over the place, then thread affinity could be an explanation.

worleyph commented 8 years ago

My experiments are all FC5AV1C-L or FC5AV1C-01. Maybe @golaz 's runs qualify. I'll go ahead and build mpi-only for FC5AV1C-L (and without hyperthreading) and see what happens.

ndkeen commented 8 years ago

I ran a few diff experiments, including with some basic profiling with craypat and I see steps on the order of 28 seconds. Is that "normal" or still slow? I do see that the last step was 2x the first 4 -- however, that was a step in which a restart file was written, so if that time includes writing largish files, could make sense.

tStamp_write: model date = 10102 0 wall clock = 2016-04-06 05:55:24 avg dt = 28.44 dt =28.44 tStamp_write: model date = 10103 0 wall clock = 2016-04-06 05:55:51 avg dt = 27.48 dt =26.52 tStamp_write: model date = 10104 0 wall clock = 2016-04-06 05:56:18 avg dt = 27.25 dt =26.77 tStamp_write: model date = 10105 0 wall clock = 2016-04-06 05:56:43 avg dt = 26.77 dt =25.35 tStamp_write: model date = 10106 0 wall clock = 2016-04-06 05:57:43 avg dt = 33.41 dt =59.98

create_newcase -case f-craypat -compset FC5AV1C-L -res ne30_oEC -mach edison -compiler intel -project acme everything else default

worleyph commented 8 years ago

@ndkeen, what PE layout (and compset, just to be sure)?

worleyph commented 8 years ago

@ndkeen, I don't remember what the default PE layout is, but 25 second timesteps means 20 minutes per simulated day, which is around 0.2 SYPD, so sounds really slow to me.

rljacob commented 8 years ago

@ndkeen is saying "steps" but what he's showing is the time per simulated day. So 28 seconds for a day.

worleyph commented 8 years ago

Thank you. Should have noticed that. So around 9 SYPD. That is not a "slow" run compared to the other results. My per timestep costs were between 12 and 38 seconds in one of my runs.

ndkeen commented 8 years ago

OK, I see you are looking at a rate. I ran another run with 500 value for STOP_N and the dt values in the tStamp line of the cpl log are all about the same. Average over all 500 lines is about 27. Assuming this says I'm not running slow, I need to be able to repeat the slowdown.

I can keep trying and/or try to see how I'm doing things any different than others.

I am running with a recent version of next, but that shouldn't matter right?

worleyph commented 8 years ago

Or perhaps the problem has disappeared and everyone will be "fast" now. Similar things have happened in the past. @ndkeen , perhaps someone else should run again, to see if they still see poor performance. My MPI-only job is still in the queue, so I can't help. And if it is fast, we will still need to run a hybrid MPI/OpenMP job.

worleyph commented 8 years ago

@ndkeen, my slow 58 node hybrid MPI/OpenMP job is still slow. The MPI-only job is still waiting to run.

golaz commented 8 years ago

I had a coupled run that was very slow. I have another one that just started and also seems very slow.

worleyph commented 8 years ago

My MPI-only run ran. It was fast enough to produce timing output, but still very slow. I do not have directly comparable fast run data, but

58 nodes: (fast), 5 days

TOT Run Time:     259.614 seconds       51.923 seconds/mday         4.56 myears/wday

225 nodes: MPI-only (slow), 5 days

TOT Run Time:    2533.762 seconds      506.752 seconds/mday         0.47 myears/wday

684 nodes: (fast), 1 month

TOT Run Time:     425.328 seconds       13.720 seconds/mday        17.25 myears/wday

The MPI-only run was 5400x1, and the same for all components (so sequential, not concurrent). Even in this case

    CPL:ocnpre1_atm2ocn                               -      240    -     290.571564

    i:cice_run_initmd                                 -      240    -     189.912659

  l:clm_run                                           -      241    -     163.140518

(but the timers do not indicate where in l:clm_run. It is NOT in I/O.)

              a:repro_sum_allr_minmax                 -     1495    -     102.437363     
              a:repro_sum_loopb                       -     1495    -       0.015330     
              a:repro_sum_allr_i8                     -     1495    -      29.266060     

So loopb cost was small (as it should be), but repro_sum_allr_i8 is still large.

Nothing really jumps out at me, except that I/O does not appear to be a significant factor, if at all. Threading may make things worse? But MPI-only is still pretty bad.

golaz commented 8 years ago

timing

Here is an updated performance figure on Edison. All simulations are using 40 nodes without threading.

ndkeen commented 8 years ago

Here is a plot from my run of 500 units. I am also trying to run with a profiler.

timing

worleyph commented 8 years ago

@ndkeen

I am running with a recent version of next, but that shouldn't matter right?

Just curious why. Any problem using master?

golaz commented 8 years ago

@ndkeen , that looks very good to me. You don't seem to have encountered any slow days (yet). If this throughput can be maintained, you'd be close to 8 SYPD, which sounds good given that 40 nodes should give 2 SYPD. I'd be so thrilled to get that kind of throughput! Are you using one of the layout from @worleyph or a different one?

helenhe40 commented 8 years ago

Sorry I am on vacation this week in a place with very rare internet access. I wrote in Qi's original NERSC ticket to a group of people last Friday before I left that Noel will work on collecting more data whether the slowness is IO related or not. And if not, I mentioned that there was an investigation from Cray for CESM variation and they thought other network-intensive applications can cause victim applications to slow down with non-optimal adaptive routing. Cray suggested us to use large pages by default, which we are not confident that memory fragmentation will not be a problem. The current slowness has been reported to Cray by Woo-Sun on Monday and Cray will be looking for aggressor applications ran recently on Edison.

Sorry I will not have wifi again next few days. I will be back in office on Monday to follow up.

Helen

On Apr 7, 2016, at 1:42 PM, golaz notifications@github.com wrote:

@ndkeen https://github.com/ndkeen , that looks very good to me. You don't seem to have encountered any slow days (yet). If this throughput can be maintained, you'd be close to 8 SYPD, which sounds good given that 40 nodes should give 2 SYPD. I'd be so thrilled to get that kind of throughput! Are you using one of the layout from @worleyph https://github.com/worleyph or a different one?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ACME-Climate/ACME/issues/820#issuecomment-207079766

ndkeen commented 8 years ago

I am using the default layout. One difference is that I changed the max number of tasks per node to 24 (per https://github.com/ACME-Climate/ACME/pull/818). It's worth exploring any differences between the way I'm running and others to be sure, but my hunch is that I've just been lucky so far.

I'm running another 20 node run using the pes.xml I attached. After 30 units of simulation time, I don't see any blips yet. The average dt is ~146 and all dt's are near this average.

env_mach_pes.xml.zip

I'm also running with craypat-built executables to profile the run, but if it does not exhibit slowdowns, I might not get much useful info in debugging this particular problem.

Certainly there are discussions at NERSC about this issue as there have been others reporting slowdowns (many, if not all of these were with a member of the CESM family as the application)

tangq commented 8 years ago

I'll try halving the max number of tasks per node on my run to see if the problem goes away.

ndkeen commented 8 years ago

Additional information from Po-Lun Ma working with Phil Rasch:

  1. We have 2 identical model simulations (labeled as “v1ac15” and “va00”) that are compiled and run on edison and produced b4b the same results (a 17-month simulation).
  2. One of the simulations (“va00”) show that some of the history files (model output) are corrupted (note that the model run is not affected).
  3. In the corrupted history file, a lot of (but not all) fields are corrupted. Some of the corrupted fields are 1-dimentional (n-column) and the others are 2-dimensional (n-lev, n-column).

My questions: Is this potentially related to the slowness? And is there a good way to easily determine if a file is corrupted so that I can verify if mine are?

golaz commented 8 years ago

@ndkeen, I tried a simulation where I changed the max number of tasks per node to 24. Unfortunately, it did not work in my case. Still running very slowly.