flux-framework / distribution

0 stars 2 forks source link

CTS-1 flux-sched testing plan #14

Closed dongahn closed 7 years ago

dongahn commented 8 years ago
Goals

Understand the performance and scalability characteristics of flux-sched as we vary the number of nodes/cores managed by a flux instance (i.e., up to CTS-1’s scales), job geometries (e.g., job sizes) and job submission rates. At the end, we will have figure-of-merit numbers as our baselines and the performance profiles for each test configuration for immediate or future performance improvements.

Order of Testing

Our campaign should be done in a “easy or more confident” to “hard or less confident” fashion so that we can address easier issues along the way. (Note that this still is a draft: I will need some discussion to decide what we need to test for different scheduling algorithms, exit criteria and refine the testing coverage).

Test Types
dongahn commented 8 years ago

Ok. I talked to @tpatki. She will propose an initial set of high-level performance factors as part of flux-sched's #166 and refine instrumentation points before finalizing her implementation.

The initial high-level performance factors may look like:

She will propose ways to aggregate those times over configurable numbers of jobs (i.e., emitting a vector of these times every k number of jobs).

dongahn commented 8 years ago

I am hitting the same content service issue as @steVwonder:

2016-08-12T07:59:57.282310Z sched.debug[0]: job (5721) assigned new rdl.
2016-08-12T07:59:57.282431Z sched.info[0]: hostname: hype348, digest: FBA70831431781E7B00B83EF6EDA4D14DEC4311F
2016-08-12T07:59:57.282446Z sched.info[0]: broker found, rank: 1
2016-08-12T07:59:57.301192Z content-sqlite.err[0]: store: executing stmt: disk I/O error(778)
2016-08-12T07:59:57.301372Z broker.err[0]: content store: Input/output error
2016-08-12T07:59:57.301386Z broker.debug[0]: content flush begin
2016-08-12T07:59:57.301794Z broker.debug[0]: content flush +1 (dirty=1 pending=1)
2016-08-12T07:59:57.312971Z content-sqlite.err[0]: store: executing stmt: database disk image is malformed(11)
2016-08-12T07:59:57.313135Z broker.err[0]: content store: Invalid argument
2016-08-12T07:59:57.313147Z broker.debug[0]: content flush begin
2016-08-12T07:59:57.313467Z broker.debug[0]: content flush +1 (dirty=1 pending=1)

I think in this case the problem occurred maybe because I didn't set --persist-filesystem or its family. Is there a default directory the content service is using so perhaps I can go there and see the db size?

garlick commented 8 years ago

If the instance is still running you can ask flux getattr scratch-directory and the sqlite file is under the content subdirectory in that directory. Usually it's in /tmp where there may be limited space available.

Try setting --persist-filesystem to one of the NFS tmp file servers.

dongahn commented 8 years ago

My dry run testing also reports:

2016-08-12T16:46:30.227530Z connector-local.err[0]: send hb event to client 79392: Broken pipe

My guess is waitjob client finished and connector is reporting that disconnect condition... If this is indeed caused by that, perhaps waitjob can be modified to do a bit friendly disconnect. I will take a look.

dongahn commented 8 years ago

@garlick: thanks. OK I will set this to a NFS.

This is somewhat related topic. But for LC's flux installation which our users will kick the tire on, does it make sense we configure --persist-filesystem to a large enough NFS (/nfs/tmp2 somewhere) by default? My testing shows that this condition occurs at around 4000 jobs.

garlick commented 8 years ago

Hmm. Like in a cluster-specific config file? Sad to say, we don't have one at the moment :-(

I think probably the best way to handle that for now is to provide some documentation for early LC adopters so they can get set up properly...

dongahn commented 8 years ago

Yeah. A guide document makes sense to me. We could have an extended quick start for LC users.

dongahn commented 8 years ago

Hmmm. It seems 3 hours are not enough to schedule 10,000 unit sleep 0 jobs at 2 nodes:

hype356{dahn}12: cat slurm-20445.out 
ok 1 - sched at MOAB.SLURM.00000002.00002Nx01B.unit.usleep.00010000.10800: size matches config
ok 2 - sched at MOAB.SLURM.00000002.00002Nx01B.unit.usleep.00010000.10800: sched module loaded
ok 3 - sched at MOAB.SLURM.00000002.00002Nx01B.unit.usleep.00010000.10800: persist fs attr matches
ok 4 - sched at MOAB.SLURM.00000002.00002Nx01B.unit.usleep.00010000.10800: BPN matches config
slurmd[hype245]: *** STEP 20445.0 CANCELLED AT 2016-08-13T18:13:59 DUE TO TIME LIMIT ***
slurmd[hype245]: *** JOB 20445 CANCELLED AT 2016-08-13T18:13:59 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
lipari commented 8 years ago

Hype allows direct submission to the pbatch partition. Use sbatch to submit your jobs and Moab won't get involved.

grondo commented 8 years ago

@dongahn, adding flux wreck purge periodically to your run using flux cron may also help speed up the 10K jobs run. See this comment for an example, and the surrounding text for how it can impact runtime.

(That is, unless you need all kvs job state)

dongahn commented 8 years ago

Thanks @lipari and @grondo.

@lipari: I do plan to support various native workload managers including sbatch and moab to begin with. But I planned to stick with moab as I didn't think using sbatch makes much difference. Hope this makes sense.

@grondo: thank you for the ref. I will ultimately try that as I come to understand the perf/scal characteristics better. In fact, periodic purging can be one of the optimizations we set by default to ensure scheduler scalability. All should be a result of the scheduler scalability milestone.

Now, that 3 hours not enough problem might have been a red herring. I ran this again last night and the broker.out shows:

2016-08-14T01:28:35.877737Z job.info[0]: got request job.submit
2016-08-14T01:28:35.878075Z job.info[0]: Setting job 2 to reserved
2016-08-14T01:28:35.879966Z job.info[0]: Setting job 2 to submitted
2016-08-14T01:28:35.880032Z sched.debug[0]: attempting job 2 state change from null to null
2016-08-14T01:28:35.880075Z sched.debug[0]: attempting job 2 state change from null to reserved
2016-08-14T01:28:35.881596Z sched.debug[0]: attempting job 2 state change from reserved to submitted

<CUT> 

2016-08-14T03:28:56.969979Z lwj.10000.info[0]: job complete. exiting...
2016-08-14T03:28:56.969991Z lwj.10000.info[0]: archiving lwj 10000
2016-08-14T03:28:56.970234Z sched.debug[0]: attempting job 10000 state change from running to complete
2016-08-14T03:28:56.970408Z sched.debug[0]: Released resources for job 10000
2016-08-14T03:28:57.015645Z connector-local.debug[0]: unsubscribe hb
2016-08-14T03:28:57.015827Z connector-local.debug[0]: unsubscribe wreck.10000.
2016-08-14T03:28:57.390494Z broker.debug[0]: content purge: 295 entries

10K jobs did complete in about 2 hours. And I believe there is a bug in how I used waitjob. I will take a look at it next week.

dongahn commented 8 years ago

In any case, I have been thinking about some of the useful performance metrics which can allow me to quickly characterize the merits of scheduler performance. Ones that I thought would be useful:

I believe that these high level metrics collected from our search space will give me some good representative configurations to further look into with a fine toothed comb.

lipari commented 8 years ago

Instead of "executed", I would substitute "scheduled". Also, how do you adjust for job submission time? I would expect different results if you started with a queue of 10,000 pre-submitted jobs vs. submitting jobs as the test progresses. IOW, will the job submission activity and also average pending queue size be contributing factors?

dongahn commented 8 years ago

@lipari: I understand where you came from your suggestion for 'scheduled' but I actually think it makes sense to have these initial high level metrics to measure the entire throuput beyond schedule-only overhead. What I think we should gauge is whether scheduling is on the critical path in acheiving reasonable results on these metrics. If scheduling is not on the critical path, then we should already be doing reasonably well. Because of this, unless you disagree strongly i would like to use my suggested merics first and see where we are.

As to submission time: since the purpose of this testing is to create a stress case I think it makes sense to submit all of the 10K jobs up front. Once we understand this worst case we can see having delays in submit times make our system perform better later.

dongahn commented 8 years ago

It seems my initial data points indicate I will need to look at a bit more detailed performance profiles of flux-sched and that we have some work to do. Looking at just two configurations to test waters, the job throughput signature under flux-sched looks pretty interesting to me.

In general, it looks to me worse than without the scheduler if I understand this comment suggested by @grondo. My initial theory is there are some operations in sched whose performance is affected by the length of queues. It is also somewhat counterintuitive that throughput gets worse at 4 nodes relative to 2 nodes. This could mean that resrc lookup/traverse time could be on the critical path.

tpatki commented 8 years ago

@dongahn Just looking at the 2 node/4 node results: the difference seems minor in throughput, which makes me wonder if it is flux-sched or something else. I'm confused how you are measuring the JEPS: are you timing each iteration of the test case (t1003-stress) and dividing (1000 jobs/total secs)?

Is it possible to get actual submit time from flux-submit as well? If there was a minor submission delay for some of the jobs when going from 2 to 4 nodes, the overall turnaround time might be different. Also, there might be a delay introduced by sharness calls when going from 2-4 nodes if you're using that as a launch interface.

tpatki commented 8 years ago

Also, I'm unclear about how the Job Group IDs differ: why is "Group ID 1" in the range of 31-32, while several other groups are below 4? Any thoughts?

dongahn commented 8 years ago

Re: throughput b/n 2 and 4 nodes: the difference is small and the perf dominant factor might be something other than sched. But job throughout should be higher as you scale up with this workload, i think (unit size = 1 process) as there are more nodes to schedule the jobs. I think there is a choking point somewhere.

dongahn commented 8 years ago

@tpatki: Thank you for the comments and sorry for the short response above (I was at a church meeting).

I'm confused how you are measuring the JEPS: are you timing each iteration of the test case (t1003-stress) and dividing (1000 jobs/total secs)?

Overall JEPS: # of jobs / elapsed time. Overall JEPS is hence 10000/elapsed time. For time-wise JEPS is a poor man's time series analysis. For each separate 1000 jobs, I measured JEPS: 1000/elapsed time for that 1000 jobs.

Is it possible to get actual submit time from flux-submit as well? If there was a minor submission delay for some of the jobs when going from 2 to 4 nodes, the overall turnaround time might be different. Also, there might be a delay introduced by sharness calls when going from 2-4 nodes if you're using that as a launch interface.

I use two time ranges as my elapse time.

The throughput metrics don't make much difference under either elapse times. Apparently, submission of jobs and execution of some of the jobs submitted earlier are overlapped.

Also, I'm unclear about how the Job Group IDs differ: why is "Group ID 1" in the range of 31-32, while several other groups are below 4? Any thoughts?

Group ID 1 is the metrics for the jobs from jobid 1 to 999. This is why I kind of suspect there may be some issues with queue length. Perhaps the fist few jobs are executed fast with a short job queue and then once the pending job queues are grown beyond certain length, some scheduler operations are degraded. In fact, group id 2 (1000 - 1999) has the least job throughput and this interval may be the ones that the scheduler has to deal with the longest queues.

Having some performance profiles can help me reason about these.

dongahn commented 8 years ago
dongahn commented 8 years ago

OK. This code will definitely perform poorly with long queues. I think this is relatively easy to restructure to optimize.

dongahn commented 8 years ago

This code can also have performance impacts for high throughput job scheduling. It iterates through all of the pending jobs to find a job that can be scheduled/execute on a job or resource event. Unless we do conservative backfill, perhaps we should break out of the loop once there is no resources left...

But for any backfill scheduling with some resources left to use, we probably cannot avoid this operation fully. Perhaps, we should add a parameter to our scheduling which we can use to configure to set the number of pending jobs to consider for backfill.

dongahn commented 8 years ago

OK. Using wreckrun at 2CNs, the throughput is about 4x better. Once I get a wreckrun number at 32CNs, I will add some optimizations to sched to see if sched's FCFS can at least match wreckrun performance.

lipari commented 8 years ago

For both FCFS and backfill scheduling, adding a configurable, iteration limit to the number of pending jobs to search is a fine idea and will help to improve performance.

dongahn commented 8 years ago

@lipari, agreed. thanks.

dongahn commented 8 years ago

At wreckrun at 32CNs, the throughput is about 25x better. I think there can be optimization at the flux-core level, but since this is a scheduler milestone, I will focus on scheduler optimization to get closer to what wreckrun can do.

SteVwonder commented 8 years ago

To throw another potential bottleneck into the fray: I believe that the KVS could be a bottleneck. Submitting all of the jobs at once will put them all into a single KVS directory. This would certainly explain the slowdown as more jobs are adding to the directory. It could also explain why performance seems to rebound after Job Group ID 3; jobs are completing and being moved to the KVS job archive, which speeds up future KVS operations on pending/running jobs.

tpatki commented 8 years ago

@SteVwonder Re: KVS. I agree, that was my guess too. I think we need a clean way of testing scaling characteristics both core and sched in isolation; and then putting them together. I think Caliper data analysis will be helpful.

dongahn commented 8 years ago

@SteVwonder: I believe it is one of the bottlenecks. And I need to tease part bottlenecks from the scheduler and KVS and other core services so that we can tackle the problem in a manageable way. I think building a performance model for our scheduler will go a long way towards teasing apart the performance dominant factors. Stay tuned.

dongahn commented 8 years ago

I think I now understand some of the steep throughput degradation after Job 3000 (e.g., JEPS for Job Group ID 3 at 2CN is 0.4 as shown in this comment). One of the factors is essentially a head-of-line blocking at the message queue for sched.

Because sched isn't processing the reserved and submit job events to enqueue a job to its own pending job queue as quickly as possible, these job event messages are increasingly queued up.

And when there are high enough numbers of these job creation events are queued up, they block other important resource events (e.g., a job complete) delivered, which are needed for the scheduler to do actual scheduling. And for the time being, the resources are idling.

This is aggravated by the fact that sched calls schedule_jobs for each job creation event. I talked w/ both @garlick and @lipari, and it seems we can use a simple optimization scheme (e.g., delayed schedule_jobs for job submit event) to improve sched's job enqueuing rate.

I have a concept implementation, which shows, this along with queue-length limited to 30, can improve the overall job throughput by a factor of 2.3 at at 2CN. IOW, the makespan of 10K jobs at this small scale, going from ~90 mins to ~40 mins.

garlick commented 8 years ago

This is aggravated by the fact that sched calls schedule_jobs for each job creation event. I talked w/ both @garlick and @lipari, and it seems we can use a simple optimization scheme (e.g., delayed schedule_jobs for job submit event) to improve sched's job enqueuing rate.

I was about to add some pointers to kvs commit aggregation code which does something similar but maybe that's a digression for this issue. If you want to open another issue on this I'd be happy to pitch in there.

dongahn commented 8 years ago

Yes, I was going to open up an issue for this item. Thanks.

dongahn commented 8 years ago

OK. As we all suspected, KVS performance degradation is definitely one of the major factors.

I added one more sharness test where each of 10 subtests runs 1000 jobs and prints out the performance metrics for its 1000 jobs (using my flux build w/ all of the above optimization concept implementations).

This is different than the first test because in this mode the scheduler will essentially start fresh (e.g., no jobs in the pending queue etc) at each subtest boundary (e.g., no jobs in the pending queue etc). Nevertheless, I found that JEPS goes down progressively and consistently.

Similar to @grondo's suggestion, I unlinked all of the lwj at the end of each subtest and this bumped up JEPS all the way to 30 for all subtests.

Since we will address the KVS issue ultimately, I will rewrite my tests to use wreck purge periodically. The trick is to get the jobs' timestamps before a purge, but this should be an easy enough task.

This may hide some other bottlenecks in sched that can manifest itself under a degraded condition but at least will allow me to vary more parameters and see their perf/scal characteristics.

dongahn commented 8 years ago

@grondo: is there any way one can tell wreck purge not to purge certain LWJs based on their job id? I wasn't able to find one, reading its man page so probably the answer is no, but I thought I should ask.

The reason I wanted to have something like that is, I don't need to collect the timestamp of each job but I thought it would be good to collect the timestamp of every Nth jobs (e.g., 1000 jobs) to compute some time series data.

I had a poor man's kvs-unlink script that I registered with flux cron, but that's stressing kvs too much, the performance doesn't seem to go up as much as I expected.

grondo commented 8 years ago

@dongahn, no there is not such an option now but I check to see if that would be an easy thing to add.

grondo commented 8 years ago

@dongahn, check out the experimental change to flux-wreck purge in grondo/flux-core@ed7f0add77a7e3b59811c91ac0f1e5f802ea0e78. This adds a -k, --keep=CODE option to purge which will skip purge of LWJ ids for which Lua CODE returns true.

You could cherry-pick this commit into your working tree and see if it does what you need.

E.g., the following example keeps all job ids that are multiples of 5: (note the escaped quotes in the flux cron invocation so it is not lost in the scriplet registered with cron)

grondo@flux-core:~/workspace (purge-keep) $ src/cmd/flux start                                                                                                                               
grondo@flux-core:~/workspace (purge-keep) $ flux cron event -n 10 wreck.state.complete flux wreck purge -Rt 10 -k \'id % 5 == 0\'
event: cron-1 created
grondo@flux-core:~/workspace (purge-keep) $ for i in `seq 0 100`; do flux wreckrun /bin/true; done
grondo@flux-core:~/workspace (purge-keep) $ flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     5      1 complete   2016-08-25T14:39:17       0.057s        0 true
    10      1 complete   2016-08-25T14:39:19       0.047s        0 true
    15      1 complete   2016-08-25T14:39:21       0.054s        0 true
    20      1 complete   2016-08-25T14:39:23       0.057s        0 true
    25      1 complete   2016-08-25T14:39:25       0.068s        0 true
    30      1 complete   2016-08-25T14:39:27       0.057s        0 true
    35      1 complete   2016-08-25T14:39:29       0.051s        0 true
    40      1 complete   2016-08-25T14:39:31       0.051s        0 true
    45      1 complete   2016-08-25T14:39:33       0.066s        0 true
    50      1 complete   2016-08-25T14:39:35       0.053s        0 true
    55      1 complete   2016-08-25T14:39:37       0.043s        0 true
    60      1 complete   2016-08-25T14:39:38       0.066s        0 true
    65      1 complete   2016-08-25T14:39:40       0.050s        0 true
    70      1 complete   2016-08-25T14:39:42       0.045s        0 true
    75      1 complete   2016-08-25T14:39:44       0.048s        0 true
    80      1 complete   2016-08-25T14:39:46       0.074s        0 true
    85      1 complete   2016-08-25T14:39:48       0.050s        0 true
    90      1 complete   2016-08-25T14:39:50       0.070s        0 true
    95      1 complete   2016-08-25T14:39:52       0.044s        0 true
   100      1 complete   2016-08-25T14:39:54       0.047s        0 true
dongahn commented 8 years ago

Nice! This should work perfectly for my use case. (id - sess_start_id) % 1000 == 0. Huge thanks!

dongahn commented 8 years ago

@garlick or @grondo: it seems there is a performance issue with the current flux-core master.

With 181d3ca, my scheduler's job throughput at 2CNs for 2k jobs is only 4.7, whereas with the old snapshot fd8723b which I have been using, the throughput is 23.2. This is almost 5x difference.

Do you have any canned performance suite to tease apart this issue?

dongahn commented 8 years ago

OK. I cherry-picked @grondo's new wreck purge option into my old flux-core snapshot (fd8723b).

test_expect_success "sched at ${SCAL_TST_CONFIG}: ${MAIN_TST_MSG}" '
<CUT>
flux cron event -n 1000 wreck.state.complete flux wreck purge -v -Rt 128 -k \'\''id-${st_jobid} == 0 or \(id-${st_jobid}+1\)%1000 == 0\'\'' &&

This in combination with all of the sched optimization concept mods, JEPS is now at 24.15 for 10K jobs! The makespan for 10K jobs has gone down to ~7 mins from the original 90 mins for this config.

grondo commented 8 years ago

Do you have any canned performance suite to tease apart this issue?

No, but we should figure something out. If we could get a test that runs on the order of ~1m it would be easy to bisect this one between merge commits.

I'll try experiments with just back-to-back wreckruns.

grondo commented 8 years ago

I'll try experiments with just back-to-back wreckruns.

I'm using the following simplest of simple testcases:

 time src/cmd/flux start -s 4 'for i in $(seq 1 100); do flux wreckrun -n 4 sleep 0; done'

And consistently seeing ~25s for current master, and consistently ~10s on your "good" sha1 (fd8723b).

I'm bisecting now to get a hint where the problem was introduced.

grondo commented 8 years ago

@dongahn, opened flux-framework/flux-core#784 to track the slowdown.

dongahn commented 8 years ago

When the job size increases in proportion to the compute-node count, it appears the job throughput decreases roughly linearly. With half sizing policy where each of 10K flux jobs in the flux instance of N compute nodes is 0.5 x N tasks, the following is the overall JEPS and TEPS (Note that this is with flux cron event -n 1000 wreck.state.complete flux wreck purge -v -Rt 216) :

CNs Flux Job Task Size JEPS TEPS
2 1 23.2 23.2
4 2 4.8 9.6
8 4 2.4 9.6
16 8 1.7 13.8
32 16 1.2 18.85

It is also interesting at 4CNs that performance degradation over time came back:

Job Group ID JEPS
1 32.6
2 25.2
3 21.7
4 17.8
5 2.2
6 3.2
7 3.4
8 3.5
9 3.8
10 3.5

I think kvs plays a role here, but I will look at the half sizing policy more closely.

dongahn commented 7 years ago

I am closing this as we discovered a slew of scalability and performance problems by using PerfExplore and added optimization to address them. Not all of it but many.