Closed worleyph closed 7 years ago
Note that the above April 5 runs were using the thread affinity suggestions from Sam Williams. These did not help. Runs without these were similarly slow (but I did not capture timing data from these becuase they were so slow that I had not set the timing output frequency appropriately).
What would be the argument to create_test?
I rarely use create test. This is purely performance, not functionality (I hope).
./create_newcase -case FC5AV1C-L.ne30_oEC.edison_intel_58 -compset FC5AV1C-L -res ne30_oEC -mach edison -compiler intel -project acme
and then grab the 58 node layout from https://acme-climate.atlassian.net/wiki/x/gAEgAw . Or grab the 684 node PE layout if you prefer (in which I would modify the case name to)
./create_newcase -case FC5AV1C-L.ne30_oEC.edison_intel_684 -compset FC5AV1C-L -res ne30_oEC -mach edison -compiler intel -project acme
You'll need to request at least 59 minutes to get any timing output for the 58 node run (based on today's performance).
I'd still like to know the create_test arg if anyone knows it. Has this been run on Cori?
Tagging @helenhe40
@ndkeen : You can create a case using create_newcase as mentioned above, setup+build+submit it. The create_test script is only for creating tests.
Was this particular problem running fine on edison at one point and then just suddenly started running slowly? This is not a 'test' ?
Yes, the setup above is not a "test" (a case). Check the timings above from Pat that shows the change in timings (not just I/O) in the past 3 days.
Update: a new test just finished (ran out of time) using a fresh checkout, 684 node PE layout, and default thread affinity. It was "faster", but perhaps just randomness?
Made it through 4 days before running out of time, but comparing to 2 day run performance data as before:
CPL:INIT 450.894501
CPL:RUN_LOOP 1262.003052
CPL:ATM_RUN - 96 - 755.404846 18.712259 0.160244
so only 34 times slower in the atmosphere, instead of 67 times as in the previous run.
Note that one of the timers with a significant discrepancy is cice_run_initmd:
cice_run_initmd 1536 1536 2.949120e+05 2.888784e+05 214.260 ( 0 0) 172.892 ( 879 0)
as compared to
cice_run_initmd 1536 1536 2.949120e+05 3.813717e+02 0.293 ( 4 0) 0.212 ( 923 0)
and this looks to be primarily calls to global_sum, so collective communication. This could be load imbalance, due to I/O for example, but, if so, then the load imbalance moves around a lot since the minimum and the maximum are not that different.
I used to have an allreduce kernel to look at these sorts of issues, but I don't have time to pull this out now. Perhaps someone at NERSC can check whether the MPI collectives have changed recently.
From Pat: JobId=311413 was slow JobId=310901 was slow JobId=310747 was slow JobId=303364 was slow
JobId=302649 was fast
Original NERSC incident number: INC0082528
Reminder: When Cori mysteriously started to be slow, it was ultimately linked to an Infiniband cable connection: https://github.com/ACME-Climate/ACME/issues/593#issuecomment-173759296
I emailed NERSC again just to see if they had heard of anything that might explain this (or complaints from others users). Woo-Sun (NERSC consultant) answered and I just spoke with him. Guess who else recently complained about this (very slow/variable timing)? A CESM user. Woo-Sun is trying to gather information to present to the edison folks.
I ran what Pat suggested and it seems to run quickly. jobid=312493
Init Time : 361.193 seconds
Run Time : 164.853 seconds 32.971 seconds/day
Final Time : 0.476 seconds
Here is some timing info from coupled low-res (ne30_oEC) simulations from @tangq and myself on Edison. The top panel shows the timing for each simulated day. The bottom panel, the integrated throughput, faster runs have smaller slopes.
There is a lot of variability between one simulation and another (even with identical core numbers), and also within a given simulation. In the most extreme case, it took over an hour to complete a single simulated day.
Any tips on extracting the number of seconds elapsed per timestep from the timing data?
@ndkeen, I found the info in the cpl.log.* file. It looks like:
tStamp_write: model date = 10723 0 wall clock = 2016-04-05 15:21:25 avg dt = 376.15 dt = 510.71 tStamp_write: model date = 10724 0 wall clock = 2016-04-05 15:29:34 avg dt = 376.70 dt = 489.11
The relevant numbers are the model simulated date (10723) and last dt which is the time in seconds.
Here is my python script to generate the plot above
Thanks much!
edison10% grep tStamp cpl.log.160405-140314
tStamp_write: model date = 10102 0 wall clock = 2016-04-05 14:10:39 avg dt = 37.72 dt = 37.72
tStamp_write: model date = 10103 0 wall clock = 2016-04-05 14:11:05 avg dt = 31.65 dt = 25.58
tStamp_write: model date = 10104 0 wall clock = 2016-04-05 14:11:30 avg dt = 29.61 dt = 25.54
tStamp_write: model date = 10105 0 wall clock = 2016-04-05 14:11:56 avg dt = 28.66 dt = 25.80
tStamp_write: model date = 10106 0 wall clock = 2016-04-05 14:12:46 avg dt = 32.94 dt = 50.06
I guess I need to run longer to get more data.
Reminder: When Cori mysteriously started to be slow, it was ultimately linked to an Infiniband cable connection: #593 (comment)
I keep forgetting that this was Cori and not Edison. In any case, I think that they later decided that this was not the diagnosis after all.
I ran what Pat suggested and it seems to run quickly. jobid=312493
Lucky you. My latest run, with MPICH_COLL_OPT_OFF == 1, was "between" my two other slow runs, so still slow.
CPL:INIT 888.324280
CPL:RUN_LOOP 1600.302612
CPL:ATM_RUN - 96 - 935.837524 24.422390 1.014124
Do we know if a separate line in the nice plot above is from a single batch job, or a simulation case that was completed in a few batch jobs (through restarts) that used different sets of compute nodes? Also, if you have job ID's that tends be a frequent question NERSC will ask.
Can I reduce the size of the problem so that I can use fewer nodes and get in/out of the queue faster?
I saw the slow performance using the 58 nodes PE layout available from https://acme-climate.atlassian.net/wiki/x/gAEgAw. I assume that this is not that sensitive to the number of nodes.
@ndkeen, to clarify, each line in the plot above is a separate batch submission. The bluish lines are the same model configuration (153 cores). The first one (light blue) died after 6 months without producing restart files. It was rerun (dark blue) and ran for 9 months (but at a much slower pace). It was then restarted from existing restart (blue purple). That third submission ran even slower.
I think @golaz means "nodes" when using "cores" on the plot.
Thanks @tangq , my mistake. I meant "nodes" in the plot and comments above. Sorry for the confusion.
Looking at the data more closely - this almost looks like thread placement issues? Many things are slower.
For example, the reproducible sum has the logic: a) compute maximum and minimum exponents across all summands (an allreduce, so essentially a barrier) b) local summation algorithm c) final allreduce to compute the global sum
(a) captures all load imbalance coming in, and the summation is pretty inexpensive, so (c) is a good measure of true allreduce cost.
In a fast run from a few days ago (58 nodes) for one simulated day:
max over processes min over processes
(a) repro_sum_allr_minmax 0.377 0.059
(b) repro_sum_loopb < 0.001 < 0.001
(c) repro_sum_allr_i8 0.003 0.002
And in a recent slow run for one simulated day:
(a) repro_sum_allr_minmax 59.897 34.912
(b) repro_sum_loopb 1.001 0.189
(c) repro_sum_allr_i8 26.009 21.564
@worleyph Do you mean OpenMP thread affinity to MPI tasks?
If so, do you know if you see the same performance strangeness when OpenMP is disabled?
@worleyph Do you mean OpenMP thread affinity to MPI tasks?
I don't know what I mean - grasping for straws here. Does kind of remind me of the "old" days when using the Intel compiler and runtime assigned all threads to a single core if you did not do the magic incantation. Worth trying an MPI-only run, just to be sure. Some of my experiments included even more specific thread affinity and binding instructions than the default, but this changed nothing.
Is OpenMP enabled in all of these runs on edison?
If so, you could look at timings from the MPAS components, since they don't have OpenMP enabled. That would tell you if your theory is something to explore more at least.
Meaning, if you see the ocean (for example) having drastically different timings then it's not related to thread affinity. While if the ocean remains relatively consistent, but the atmosphere is all over the place, then thread affinity could be an explanation.
My experiments are all FC5AV1C-L or FC5AV1C-01. Maybe @golaz 's runs qualify. I'll go ahead and build mpi-only for FC5AV1C-L (and without hyperthreading) and see what happens.
I ran a few diff experiments, including with some basic profiling with craypat and I see steps on the order of 28 seconds. Is that "normal" or still slow? I do see that the last step was 2x the first 4 -- however, that was a step in which a restart file was written, so if that time includes writing largish files, could make sense.
tStamp_write: model date = 10102 0 wall clock = 2016-04-06 05:55:24 avg dt = 28.44 dt =28.44 tStamp_write: model date = 10103 0 wall clock = 2016-04-06 05:55:51 avg dt = 27.48 dt =26.52 tStamp_write: model date = 10104 0 wall clock = 2016-04-06 05:56:18 avg dt = 27.25 dt =26.77 tStamp_write: model date = 10105 0 wall clock = 2016-04-06 05:56:43 avg dt = 26.77 dt =25.35 tStamp_write: model date = 10106 0 wall clock = 2016-04-06 05:57:43 avg dt = 33.41 dt =59.98
create_newcase -case f-craypat -compset FC5AV1C-L -res ne30_oEC -mach edison -compiler intel -project acme everything else default
@ndkeen, what PE layout (and compset, just to be sure)?
@ndkeen, I don't remember what the default PE layout is, but 25 second timesteps means 20 minutes per simulated day, which is around 0.2 SYPD, so sounds really slow to me.
@ndkeen is saying "steps" but what he's showing is the time per simulated day. So 28 seconds for a day.
Thank you. Should have noticed that. So around 9 SYPD. That is not a "slow" run compared to the other results. My per timestep costs were between 12 and 38 seconds in one of my runs.
OK, I see you are looking at a rate. I ran another run with 500 value for STOP_N and the dt values in the tStamp line of the cpl log are all about the same. Average over all 500 lines is about 27. Assuming this says I'm not running slow, I need to be able to repeat the slowdown.
I can keep trying and/or try to see how I'm doing things any different than others.
I am running with a recent version of next, but that shouldn't matter right?
Or perhaps the problem has disappeared and everyone will be "fast" now. Similar things have happened in the past. @ndkeen , perhaps someone else should run again, to see if they still see poor performance. My MPI-only job is still in the queue, so I can't help. And if it is fast, we will still need to run a hybrid MPI/OpenMP job.
@ndkeen, my slow 58 node hybrid MPI/OpenMP job is still slow. The MPI-only job is still waiting to run.
I had a coupled run that was very slow. I have another one that just started and also seems very slow.
My MPI-only run ran. It was fast enough to produce timing output, but still very slow. I do not have directly comparable fast run data, but
58 nodes: (fast), 5 days
TOT Run Time: 259.614 seconds 51.923 seconds/mday 4.56 myears/wday
225 nodes: MPI-only (slow), 5 days
TOT Run Time: 2533.762 seconds 506.752 seconds/mday 0.47 myears/wday
684 nodes: (fast), 1 month
TOT Run Time: 425.328 seconds 13.720 seconds/mday 17.25 myears/wday
The MPI-only run was 5400x1, and the same for all components (so sequential, not concurrent). Even in this case
CPL:ocnpre1_atm2ocn - 240 - 290.571564
i:cice_run_initmd - 240 - 189.912659
l:clm_run - 241 - 163.140518
(but the timers do not indicate where in l:clm_run. It is NOT in I/O.)
a:repro_sum_allr_minmax - 1495 - 102.437363
a:repro_sum_loopb - 1495 - 0.015330
a:repro_sum_allr_i8 - 1495 - 29.266060
So loopb cost was small (as it should be), but repro_sum_allr_i8 is still large.
Nothing really jumps out at me, except that I/O does not appear to be a significant factor, if at all. Threading may make things worse? But MPI-only is still pretty bad.
Here is an updated performance figure on Edison. All simulations are using 40 nodes without threading.
Here is a plot from my run of 500 units. I am also trying to run with a profiler.
@ndkeen
I am running with a recent version of next, but that shouldn't matter right?
Just curious why. Any problem using master?
@ndkeen , that looks very good to me. You don't seem to have encountered any slow days (yet). If this throughput can be maintained, you'd be close to 8 SYPD, which sounds good given that 40 nodes should give 2 SYPD. I'd be so thrilled to get that kind of throughput! Are you using one of the layout from @worleyph or a different one?
Sorry I am on vacation this week in a place with very rare internet access. I wrote in Qi's original NERSC ticket to a group of people last Friday before I left that Noel will work on collecting more data whether the slowness is IO related or not. And if not, I mentioned that there was an investigation from Cray for CESM variation and they thought other network-intensive applications can cause victim applications to slow down with non-optimal adaptive routing. Cray suggested us to use large pages by default, which we are not confident that memory fragmentation will not be a problem. The current slowness has been reported to Cray by Woo-Sun on Monday and Cray will be looking for aggressor applications ran recently on Edison.
Sorry I will not have wifi again next few days. I will be back in office on Monday to follow up.
Helen
On Apr 7, 2016, at 1:42 PM, golaz notifications@github.com wrote:
@ndkeen https://github.com/ndkeen , that looks very good to me. You don't seem to have encountered any slow days (yet). If this throughput can be maintained, you'd be close to 8 SYPD, which sounds good given that 40 nodes should give 2 SYPD. I'd be so thrilled to get that kind of throughput! Are you using one of the layout from @worleyph https://github.com/worleyph or a different one?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ACME-Climate/ACME/issues/820#issuecomment-207079766
I am using the default layout. One difference is that I changed the max number of tasks per node to 24 (per https://github.com/ACME-Climate/ACME/pull/818). It's worth exploring any differences between the way I'm running and others to be sure, but my hunch is that I've just been lucky so far.
I'm running another 20 node run using the pes.xml I attached. After 30 units of simulation time, I don't see any blips yet. The average dt is ~146 and all dt's are near this average.
I'm also running with craypat-built executables to profile the run, but if it does not exhibit slowdowns, I might not get much useful info in debugging this particular problem.
Certainly there are discussions at NERSC about this issue as there have been others reporting slowdowns (many, if not all of these were with a member of the CESM family as the application)
I'll try halving the max number of tasks per node on my run to see if the problem goes away.
Additional information from Po-Lun Ma working with Phil Rasch:
My questions: Is this potentially related to the slowness? And is there a good way to easily determine if a file is corrupted so that I can verify if mine are?
@ndkeen, I tried a simulation where I changed the max number of tasks per node to 24. Unfortunately, it did not work in my case. Still running very slowly.
This, hopefully, is a transitory issue. However, the current discussion is buried across a few different Confluence pages and in e-mails with NERSC help (though we have yet to receive a response from NERSC).
Others have seen different, but all bad, performance issues. The initial indication was that I/O was very slow (which has happened on Edison before), but I am seeing everything being very slow.
Examples:
-compset FC5AV1C-01 -res ne30_oEC
a) 1 day for a 58 node PE layout (April 2):
then today (April 5)
b) 2 days for a 684 node PE layout (April 3)
then today
This is very persistent (and seems to be getting worse over the last 2 days, if anything). I can provide more details, and will look through the data to see if anything pops up of interest. In any case, Edison is so slow now that it is practically unusable at the moment.