Closed dongahn closed 7 years ago
Ok. I talked to @tpatki. She will propose an initial set of high-level performance factors as part of flux-sched
's #166 and refine instrumentation points before finalizing her implementation.
The initial high-level performance factors may look like:
load_resrc
Time)She will propose ways to aggregate those times over configurable numbers of jobs (i.e., emitting a vector of these times every k number of jobs).
I am hitting the same content service issue as @steVwonder:
2016-08-12T07:59:57.282310Z sched.debug[0]: job (5721) assigned new rdl.
2016-08-12T07:59:57.282431Z sched.info[0]: hostname: hype348, digest: FBA70831431781E7B00B83EF6EDA4D14DEC4311F
2016-08-12T07:59:57.282446Z sched.info[0]: broker found, rank: 1
2016-08-12T07:59:57.301192Z content-sqlite.err[0]: store: executing stmt: disk I/O error(778)
2016-08-12T07:59:57.301372Z broker.err[0]: content store: Input/output error
2016-08-12T07:59:57.301386Z broker.debug[0]: content flush begin
2016-08-12T07:59:57.301794Z broker.debug[0]: content flush +1 (dirty=1 pending=1)
2016-08-12T07:59:57.312971Z content-sqlite.err[0]: store: executing stmt: database disk image is malformed(11)
2016-08-12T07:59:57.313135Z broker.err[0]: content store: Invalid argument
2016-08-12T07:59:57.313147Z broker.debug[0]: content flush begin
2016-08-12T07:59:57.313467Z broker.debug[0]: content flush +1 (dirty=1 pending=1)
I think in this case the problem occurred maybe because I didn't set --persist-filesystem
or its family. Is there a default directory the content service is using so perhaps I can go there and see the db size?
If the instance is still running you can ask flux getattr scratch-directory
and the sqlite file is under the content subdirectory in that directory. Usually it's in /tmp where there may be limited space available.
Try setting --persist-filesystem
to one of the NFS tmp file servers.
My dry run testing also reports:
2016-08-12T16:46:30.227530Z connector-local.err[0]: send hb event to client 79392: Broken pipe
My guess is waitjob
client finished and connector is reporting that disconnect condition... If this is indeed caused by that, perhaps waitjob
can be modified to do a bit friendly disconnect. I will take a look.
@garlick: thanks. OK I will set this to a NFS.
This is somewhat related topic. But for LC's flux installation which our users will kick the tire on, does it make sense we configure --persist-filesystem
to a large enough NFS (/nfs/tmp2 somewhere) by default? My testing shows that this condition occurs at around 4000 jobs.
Hmm. Like in a cluster-specific config file? Sad to say, we don't have one at the moment :-(
I think probably the best way to handle that for now is to provide some documentation for early LC adopters so they can get set up properly...
Yeah. A guide document makes sense to me. We could have an extended quick start for LC users.
Hmmm. It seems 3 hours are not enough to schedule 10,000 unit sleep 0
jobs at 2 nodes:
hype356{dahn}12: cat slurm-20445.out
ok 1 - sched at MOAB.SLURM.00000002.00002Nx01B.unit.usleep.00010000.10800: size matches config
ok 2 - sched at MOAB.SLURM.00000002.00002Nx01B.unit.usleep.00010000.10800: sched module loaded
ok 3 - sched at MOAB.SLURM.00000002.00002Nx01B.unit.usleep.00010000.10800: persist fs attr matches
ok 4 - sched at MOAB.SLURM.00000002.00002Nx01B.unit.usleep.00010000.10800: BPN matches config
slurmd[hype245]: *** STEP 20445.0 CANCELLED AT 2016-08-13T18:13:59 DUE TO TIME LIMIT ***
slurmd[hype245]: *** JOB 20445 CANCELLED AT 2016-08-13T18:13:59 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
Hype allows direct submission to the pbatch partition. Use sbatch to submit your jobs and Moab won't get involved.
@dongahn, adding flux wreck purge
periodically to your run using flux cron
may also help speed up the 10K jobs run. See this comment for an example, and the surrounding text for how it can impact runtime.
(That is, unless you need all kvs job state)
Thanks @lipari and @grondo.
@lipari: I do plan to support various native workload managers including sbatch and moab to begin with. But I planned to stick with moab as I didn't think using sbatch makes much difference. Hope this makes sense.
@grondo: thank you for the ref. I will ultimately try that as I come to understand the perf/scal characteristics better. In fact, periodic purging can be one of the optimizations we set by default to ensure scheduler scalability. All should be a result of the scheduler scalability milestone.
Now, that 3 hours not enough problem might have been a red herring. I ran this again last night and the broker.out
shows:
2016-08-14T01:28:35.877737Z job.info[0]: got request job.submit
2016-08-14T01:28:35.878075Z job.info[0]: Setting job 2 to reserved
2016-08-14T01:28:35.879966Z job.info[0]: Setting job 2 to submitted
2016-08-14T01:28:35.880032Z sched.debug[0]: attempting job 2 state change from null to null
2016-08-14T01:28:35.880075Z sched.debug[0]: attempting job 2 state change from null to reserved
2016-08-14T01:28:35.881596Z sched.debug[0]: attempting job 2 state change from reserved to submitted
<CUT>
2016-08-14T03:28:56.969979Z lwj.10000.info[0]: job complete. exiting...
2016-08-14T03:28:56.969991Z lwj.10000.info[0]: archiving lwj 10000
2016-08-14T03:28:56.970234Z sched.debug[0]: attempting job 10000 state change from running to complete
2016-08-14T03:28:56.970408Z sched.debug[0]: Released resources for job 10000
2016-08-14T03:28:57.015645Z connector-local.debug[0]: unsubscribe hb
2016-08-14T03:28:57.015827Z connector-local.debug[0]: unsubscribe wreck.10000.
2016-08-14T03:28:57.390494Z broker.debug[0]: content purge: 295 entries
10K jobs did complete in about 2 hours. And I believe there is a bug in how I used waitjob.
I will take a look at it next week.
In any case, I have been thinking about some of the useful performance metrics which can allow me to quickly characterize the merits of scheduler performance. Ones that I thought would be useful:
I believe that these high level metrics collected from our search space will give me some good representative configurations to further look into with a fine toothed comb.
Instead of "executed", I would substitute "scheduled". Also, how do you adjust for job submission time? I would expect different results if you started with a queue of 10,000 pre-submitted jobs vs. submitting jobs as the test progresses. IOW, will the job submission activity and also average pending queue size be contributing factors?
@lipari: I understand where you came from your suggestion for 'scheduled' but I actually think it makes sense to have these initial high level metrics to measure the entire throuput beyond schedule-only overhead. What I think we should gauge is whether scheduling is on the critical path in acheiving reasonable results on these metrics. If scheduling is not on the critical path, then we should already be doing reasonably well. Because of this, unless you disagree strongly i would like to use my suggested merics first and see where we are.
As to submission time: since the purpose of this testing is to create a stress case I think it makes sense to submit all of the 10K jobs up front. Once we understand this worst case we can see having delays in submit times make our system perform better later.
It seems my initial data points indicate I will need to look at a bit more detailed performance profiles of flux-sched and that we have some work to do. Looking at just two configurations to test waters, the job throughput signature under flux-sched
looks pretty interesting to me.
In general, it looks to me worse than without the scheduler if I understand this comment suggested by @grondo. My initial theory is there are some operations in sched whose performance is affected by the length of queues. It is also somewhat counterintuitive that throughput gets worse at 4 nodes relative to 2 nodes. This could mean that resrc
lookup/traverse time could be on the critical path.
Job Group ID | JEPS |
---|---|
1 | 32.1 |
2 | 17.2 |
3 | 0.4 |
4 | 1.5 |
5 | 1.7 |
6 | 2.0 |
7 | 2.3 |
8 | 2.6 |
9 | 3.1 |
10 | 3.8 |
Job Group ID | JEPS |
---|---|
1 | 31.5 |
2 | 18.5 |
3 | 0.27 |
4 | 1.1 |
5 | 1.2 |
6 | 1.4 |
7 | 1.7 |
8 | 2.2 |
9 | 3.0 |
10 | 4.0 |
@dongahn Just looking at the 2 node/4 node results: the difference seems minor in throughput, which makes me wonder if it is flux-sched or something else. I'm confused how you are measuring the JEPS: are you timing each iteration of the test case (t1003-stress) and dividing (1000 jobs/total secs)?
Is it possible to get actual submit time from flux-submit as well? If there was a minor submission delay for some of the jobs when going from 2 to 4 nodes, the overall turnaround time might be different. Also, there might be a delay introduced by sharness calls when going from 2-4 nodes if you're using that as a launch interface.
Also, I'm unclear about how the Job Group IDs differ: why is "Group ID 1" in the range of 31-32, while several other groups are below 4? Any thoughts?
Re: throughput b/n 2 and 4 nodes: the difference is small and the perf dominant factor might be something other than sched. But job throughout should be higher as you scale up with this workload, i think (unit size = 1 process) as there are more nodes to schedule the jobs. I think there is a choking point somewhere.
@tpatki: Thank you for the comments and sorry for the short response above (I was at a church meeting).
I'm confused how you are measuring the JEPS: are you timing each iteration of the test case (t1003-stress) and dividing (1000 jobs/total secs)?
Overall JEPS: # of jobs / elapsed time. Overall JEPS is hence 10000/elapsed time. For time-wise JEPS is a poor man's time series analysis. For each separate 1000 jobs, I measured JEPS: 1000/elapsed time for that 1000 jobs.
Is it possible to get actual submit time from flux-submit as well? If there was a minor submission delay for some of the jobs when going from 2 to 4 nodes, the overall turnaround time might be different. Also, there might be a delay introduced by sharness calls when going from 2-4 nodes if you're using that as a launch interface.
I use two time ranges as my elapse time.
begin clock
before the submission of the first job and end clock
when waitjob
sees the completion of 10000th job;starting-time
of the first stop as my begin clock
and use the complete-time
of the 10000th job as my end clock
The throughput metrics don't make much difference under either elapse times. Apparently, submission of jobs and execution of some of the jobs submitted earlier are overlapped.
Also, I'm unclear about how the Job Group IDs differ: why is "Group ID 1" in the range of 31-32, while several other groups are below 4? Any thoughts?
Group ID 1 is the metrics for the jobs from jobid 1 to 999. This is why I kind of suspect there may be some issues with queue length. Perhaps the fist few jobs are executed fast with a short job queue and then once the pending job queues are grown beyond certain length, some scheduler operations are degraded. In fact, group id 2 (1000 - 1999) has the least job throughput and this interval may be the ones that the scheduler has to deal with the longest queues.
Having some performance profiles can help me reason about these.
Job Group ID | JEPS |
---|---|
1 | 30.4 |
2 | 11.5 |
3 | 0.5 |
4 | 0.2 |
5 | 0.3 |
6 | 0.3 |
7 | 0.4 |
8 | 0.6 |
9 | 1.0 |
10 | 3.0 |
OK. This code will definitely perform poorly with long queues. I think this is relatively easy to restructure to optimize.
This code can also have performance impacts for high throughput job scheduling. It iterates through all of the pending jobs to find a job that can be scheduled/execute on a job or resource event. Unless we do conservative backfill, perhaps we should break out of the loop once there is no resources left...
But for any backfill scheduling with some resources left to use, we probably cannot avoid this operation fully. Perhaps, we should add a parameter to our scheduling which we can use to configure to set the number of pending jobs to consider for backfill.
OK. Using wreckrun
at 2CNs, the throughput is about 4x better. Once I get a wreckrun
number at 32CNs, I will add some optimizations to sched
to see if sched
's FCFS can at least match wreckrun performance.
Job Group ID | JEPS |
---|---|
1 | 32.5 |
2 | 18.2 |
3 | 11.9 |
4 | 9.5 |
5 | 7.9 |
6 | 6.5 |
7 | 5.3 |
8 | 4.3 |
9 | 3.4 |
10 | 4.6 |
For both FCFS and backfill scheduling, adding a configurable, iteration limit to the number of pending jobs to search is a fine idea and will help to improve performance.
@lipari, agreed. thanks.
At wreckrun
at 32CNs, the throughput is about 25x better. I think there can be optimization at the flux-core
level, but since this is a scheduler milestone, I will focus on scheduler optimization to get closer to what wreckrun
can do.
Job Group ID | JEPS |
---|---|
1 | 32.7 |
2 | 18.4 |
3 | 12.8 |
4 | 9.8 |
5 | 7.9 |
6 | 6.4 |
7 | 5.2 |
8 | 4.1 |
9 | 3.2 |
10 | 4.9 |
To throw another potential bottleneck into the fray: I believe that the KVS could be a bottleneck. Submitting all of the jobs at once will put them all into a single KVS directory. This would certainly explain the slowdown as more jobs are adding to the directory. It could also explain why performance seems to rebound after Job Group ID
3; jobs are completing and being moved to the KVS job archive, which speeds up future KVS operations on pending/running jobs.
@SteVwonder Re: KVS. I agree, that was my guess too. I think we need a clean way of testing scaling characteristics both core and sched in isolation; and then putting them together. I think Caliper data analysis will be helpful.
@SteVwonder: I believe it is one of the bottlenecks. And I need to tease part bottlenecks from the scheduler and KVS and other core services so that we can tackle the problem in a manageable way. I think building a performance model for our scheduler will go a long way towards teasing apart the performance dominant factors. Stay tuned.
I think I now understand some of the steep throughput degradation after Job 3000 (e.g., JEPS for Job Group ID 3 at 2CN is 0.4 as shown in this comment). One of the factors is essentially a head-of-line blocking at the message queue for sched.
Because sched
isn't processing the reserved
and submit
job events to enqueue a job to its own pending job queue as quickly as possible, these job event messages are increasingly queued up.
And when there are high enough numbers of these job creation events are queued up, they block other important resource events (e.g., a job complete) delivered, which are needed for the scheduler to do actual scheduling. And for the time being, the resources are idling.
This is aggravated by the fact that sched
calls schedule_jobs
for each job creation event. I talked w/ both @garlick and @lipari, and it seems we can use a simple optimization scheme (e.g., delayed schedule_jobs for job submit event) to improve sched
's job enqueuing rate.
I have a concept implementation, which shows, this along with queue-length limited to 30, can improve the overall job throughput by a factor of 2.3 at at 2CN. IOW, the makespan of 10K jobs at this small scale, going from ~90 mins to ~40 mins.
This is aggravated by the fact that sched calls schedule_jobs for each job creation event. I talked w/ both @garlick and @lipari, and it seems we can use a simple optimization scheme (e.g., delayed schedule_jobs for job submit event) to improve sched's job enqueuing rate.
I was about to add some pointers to kvs commit aggregation code which does something similar but maybe that's a digression for this issue. If you want to open another issue on this I'd be happy to pitch in there.
Yes, I was going to open up an issue for this item. Thanks.
OK. As we all suspected, KVS performance degradation is definitely one of the major factors.
I added one more sharness test where each of 10 subtests runs 1000 jobs and prints out the performance metrics for its 1000 jobs (using my flux build w/ all of the above optimization concept implementations).
This is different than the first test because in this mode the scheduler will essentially start fresh (e.g., no jobs in the pending queue etc) at each subtest boundary (e.g., no jobs in the pending queue etc). Nevertheless, I found that JEPS goes down progressively and consistently.
Similar to @grondo's suggestion, I unlinked all of the lwj
at the end of each subtest and this bumped up JEPS all the way to 30 for all subtests.
Since we will address the KVS issue ultimately, I will rewrite my tests to use wreck purge
periodically. The trick is to get the jobs' timestamps before a purge, but this should be an easy enough task.
This may hide some other bottlenecks in sched
that can manifest itself under a degraded condition but at least will allow me to vary more parameters and see their perf/scal characteristics.
@grondo: is there any way one can tell wreck purge
not to purge certain LWJs based on their job id? I wasn't able to find one, reading its man page so probably the answer is no, but I thought I should ask.
The reason I wanted to have something like that is, I don't need to collect the timestamp of each job but I thought it would be good to collect the timestamp of every Nth jobs (e.g., 1000 jobs) to compute some time series data.
I had a poor man's kvs-unlink script that I registered with flux cron,
but that's stressing kvs too much, the performance doesn't seem to go up as much as I expected.
@dongahn, no there is not such an option now but I check to see if that would be an easy thing to add.
@dongahn, check out the experimental change to flux-wreck purge
in grondo/flux-core@ed7f0add77a7e3b59811c91ac0f1e5f802ea0e78. This adds a -k, --keep=CODE
option to purge
which will skip purge of LWJ ids for which Lua CODE
returns true.
You could cherry-pick this commit into your working tree and see if it does what you need.
E.g., the following example keeps all job ids that are multiples of 5:
(note the escaped quotes in the flux cron
invocation so it is not lost in the scriplet registered with cron)
grondo@flux-core:~/workspace (purge-keep) $ src/cmd/flux start
grondo@flux-core:~/workspace (purge-keep) $ flux cron event -n 10 wreck.state.complete flux wreck purge -Rt 10 -k \'id % 5 == 0\'
event: cron-1 created
grondo@flux-core:~/workspace (purge-keep) $ for i in `seq 0 100`; do flux wreckrun /bin/true; done
grondo@flux-core:~/workspace (purge-keep) $ flux wreck ls
ID NTASKS STATE START RUNTIME RANKS COMMAND
5 1 complete 2016-08-25T14:39:17 0.057s 0 true
10 1 complete 2016-08-25T14:39:19 0.047s 0 true
15 1 complete 2016-08-25T14:39:21 0.054s 0 true
20 1 complete 2016-08-25T14:39:23 0.057s 0 true
25 1 complete 2016-08-25T14:39:25 0.068s 0 true
30 1 complete 2016-08-25T14:39:27 0.057s 0 true
35 1 complete 2016-08-25T14:39:29 0.051s 0 true
40 1 complete 2016-08-25T14:39:31 0.051s 0 true
45 1 complete 2016-08-25T14:39:33 0.066s 0 true
50 1 complete 2016-08-25T14:39:35 0.053s 0 true
55 1 complete 2016-08-25T14:39:37 0.043s 0 true
60 1 complete 2016-08-25T14:39:38 0.066s 0 true
65 1 complete 2016-08-25T14:39:40 0.050s 0 true
70 1 complete 2016-08-25T14:39:42 0.045s 0 true
75 1 complete 2016-08-25T14:39:44 0.048s 0 true
80 1 complete 2016-08-25T14:39:46 0.074s 0 true
85 1 complete 2016-08-25T14:39:48 0.050s 0 true
90 1 complete 2016-08-25T14:39:50 0.070s 0 true
95 1 complete 2016-08-25T14:39:52 0.044s 0 true
100 1 complete 2016-08-25T14:39:54 0.047s 0 true
Nice! This should work perfectly for my use case. (id - sess_start_id) % 1000 == 0. Huge thanks!
@garlick or @grondo: it seems there is a performance issue with the current flux-core
master.
With 181d3ca, my scheduler's job throughput at 2CNs for 2k jobs is only 4.7, whereas with the old snapshot fd8723b which I have been using, the throughput is 23.2. This is almost 5x difference.
Do you have any canned performance suite to tease apart this issue?
OK. I cherry-picked @grondo's new wreck purge option into my old flux-core snapshot (fd8723b).
test_expect_success "sched at ${SCAL_TST_CONFIG}: ${MAIN_TST_MSG}" '
<CUT>
flux cron event -n 1000 wreck.state.complete flux wreck purge -v -Rt 128 -k \'\''id-${st_jobid} == 0 or \(id-${st_jobid}+1\)%1000 == 0\'\'' &&
This in combination with all of the sched optimization concept mods, JEPS is now at 24.15 for 10K jobs! The makespan for 10K jobs has gone down to ~7 mins from the original 90 mins for this config.
Do you have any canned performance suite to tease apart this issue?
No, but we should figure something out. If we could get a test that runs on the order of ~1m it would be easy to bisect this one between merge commits.
I'll try experiments with just back-to-back wreckruns.
I'll try experiments with just back-to-back wreckruns.
I'm using the following simplest of simple testcases:
time src/cmd/flux start -s 4 'for i in $(seq 1 100); do flux wreckrun -n 4 sleep 0; done'
And consistently seeing ~25s for current master, and consistently ~10s on your "good" sha1 (fd8723b).
I'm bisecting now to get a hint where the problem was introduced.
@dongahn, opened flux-framework/flux-core#784 to track the slowdown.
When the job size increases in proportion to the compute-node count, it appears the job throughput decreases roughly linearly. With half
sizing policy where each of 10K flux jobs in the flux instance of N compute nodes is 0.5 x N tasks, the following is the overall JEPS and TEPS (Note that this is with flux cron event -n 1000 wreck.state.complete flux wreck purge -v -Rt 216) :
CNs | Flux Job Task Size | JEPS | TEPS |
---|---|---|---|
2 | 1 | 23.2 | 23.2 |
4 | 2 | 4.8 | 9.6 |
8 | 4 | 2.4 | 9.6 |
16 | 8 | 1.7 | 13.8 |
32 | 16 | 1.2 | 18.85 |
It is also interesting at 4CNs that performance degradation over time came back:
Job Group ID | JEPS |
---|---|
1 | 32.6 |
2 | 25.2 |
3 | 21.7 |
4 | 17.8 |
5 | 2.2 |
6 | 3.2 |
7 | 3.4 |
8 | 3.5 |
9 | 3.8 |
10 | 3.5 |
I think kvs plays a role here, but I will look at the half
sizing policy more closely.
I am closing this as we discovered a slew of scalability and performance problems by using PerfExplore and added optimization to address them. Not all of it but many.
Goals
Understand the performance and scalability characteristics of
flux-sched
as we vary the number of nodes/cores managed by a flux instance (i.e., up to CTS-1’s scales), job geometries (e.g., job sizes) and job submission rates. At the end, we will have figure-of-merit numbers as our baselines and the performance profiles for each test configuration for immediate or future performance improvements.Order of Testing
Our campaign should be done in a “easy or more confident” to “hard or less confident” fashion so that we can address easier issues along the way. (Note that this still is a draft: I will need some discussion to decide what we need to test for different scheduling algorithms, exit criteria and refine the testing coverage).
Test Types
sleep 0
jobs in core-scheduling mode and compute thenumber of executed jobs per minute
as the figure of merit and a performance profile per 1000 jobsMPI_sleep 0
jobs in core-scheduling mode and compute thenumber of executed jobs per minute
as the figure of merit and a performance profile per 1000 jobssleep 0
jobs in core-scheduling mode and compute thenumber of executed jobs per minute
as the figure of merit and a performance profile per 1000 jobsMPI_sleep 0
jobs in core-scheduling mode and compute thenumber of executed jobs per minute
as the figure of merit and a performance profile per 1000 jobssleep 0
jobs whose sizes are powers of 2 up to 1/2 CNs processes and cycling through these sizes in core-scheduling mode and compute thenumber of executed jobs per minute
as the figure of merit and a performance profile per 1000 jobsMPI_sleep 0
jobs whose sizes are powers of 2 up to 1/2 CNs processes and cycling through these sizes in core-scheduling mode and compute thenumber of executed jobs per minute
as the figure of merit and a performance profile per 1000 jobswrexec
's many “fork/exec” hit some “resource limits” in which case we may need to introduce a mode in which the fork/exec are skipped