splash app tracking issue

trws commented 6 years ago

This is to help tie together information for the splash app for sierra.

General problem: Run between 1 and 2 million jobs of one node, four processes, in size with proper affinity, each process should be near a GPU and have a GPU selected, over the course of a day across four thousand nodes. It should be possible to cancel a job in flight if its priority falls below a certain level, this logic doesn't have to be in flux but the cancel mechanism needs to be available. In order to deal with the large number of jobs, we need to be able to handle long queues and fast submission simultaneously for startup and a large number of jobs over the course of a full run, purging old jobs is an acceptable solution even if it loses job information for completed jobs.

Currently the filed issues are:

high priority:

1356 getting flux wreck purge going so we can garbage collect the wreck directories
1359 cancel jobs that haven't started yet
1361 hwloc system loading non-functional at scale

Found in this effort, non-blocking:

1355 czmq dependency checking issue with versions 4+
1360 jobspec: change configure logic to opt-out

Tagging for notification: @grondo, @garlick, @SteVwonder, @dongahn

garlick commented 6 years ago

(sent this in email to @trws but should have posted here for one stop shopping): Re: redirecting sqlite db to a filesystem with plenty of free space:

By default, flux creates its sqlite db in "broker.rundir", which is normally in /tmp. An instance that has a lot of KVS activity might want to redirect it elsewhere:

flux start -o,-Spersist-directory=/new/directory

/new/directory must already exist. A "content" subdirectory will be created within it to contain the database.

SteVwonder commented 6 years ago

one node, four processes, in size with proper affinity

Flux-core (via wreck) supports setting affinities through the cpumask key in the KVS [src code], but flux-sched does not currently populate that information when scheduling jobs. I have code in a branch of flux-sched that does create this, but I have only tested the code against ~v0.5.0 of flux-core, and its not the most polished c code...sorry 😞 .

each process should be near a GPU and have a GPU selected

Does Sierra support binding processes to specific GPUs? I imagine supporting this will require work both in wreck and flux-sched. How important is locality here? Is it just picking the "closest" socket (e.g., GPUs 1-3 are closer to socket1 and GPUs 4-6 are closer to socket2), or is it more complex?

It should be possible to cancel a job in flight if its priority falls below a certain level

Do they need the jobs to be considered/scheduled in order of their priority? Or are all jobs currently in the queue fair game for scheduling?

trws commented 6 years ago

The affinity stuff will almost certainly be handled by a script for this, just for time reasons. GPU affinity can be set by an environment variable, and closest socket is sufficient.

All jobs in the queue are fair game for scheduling, priority is being handled externally.

grondo commented 6 years ago

Copying here from #1356 -- instructions for running flux wreck purge periodically within an instance using flux cron:

Purge all but 1000 jobs after every 1000th job completion, running at most once every 10 seconds:
```
flux cron event --min-interval=10 --nth=1000 wreck.state.complete flux wreck purge -R -t 1000
```

dongahn commented 6 years ago

FYI -- I already sent @trws as to how one can optimize the scheduler for HTC workloads. But in case others need to support this types of workloads, queue-depth and delay-ached should be useful.

https://github.com/flux-framework/flux-sched/pull/190 https://github.com/flux-framework/flux-sched/issues/182 https://github.com/flux-framework/flux-sched/issues/183

https://github.com/flux-framework/flux-sched/pull/191 https://github.com/flux-framework/flux-sched/issues/185

Examples: https://github.com/flux-framework/flux-sched/blob/master/t/t1005-sched-params.t#L45 https://github.com/flux-framework/flux-sched/blob/master/t/t1005-sched-params.t#L67

SteVwonder commented 6 years ago

Also mentioned in the email chain: In this scenario (i.e., all jobs are equal size), FCFS is a good candidate over backfilling. FCFS avoids the cost of making reservations in the future as well as avoids attempting to schedule jobs in the queue that are provably not able to be scheduled. In addition to loading the sched.fcfs plugin, I believe you also need to set queue-depth to 1.

dongahn commented 6 years ago

I believe you also need to set queue-depth to 1.

Indeed! I was surprised when @lipari showed cases where FCFS actually does an out-of-order schedule. But in the case of splash, their job sizes are all the same so FCFS with queue-depth being 1 should be the cheapest.

trws commented 6 years ago

I’ll definitely try that, a lower queue depth helped a little bit, but it looks like we had some non-scalable things in the sched loop too. Particularly assisting priority scheduling by doing a sort on the entire linked-list of jobs every time schedule_jobs is entered, and then traversing the entire hardware tree to clear reservations every time it’s entered as well. There may be others, but those make the performance relatively unfortunate for anything over about 1000 nodes or 1500 jobs in the queue.

Sent from VMware Boxer

On March 16, 2018 at 10:39:36 AM PDT, Dong H. Ahn notifications@github.com wrote:

I believe you also need to set queue-depth to 1.

Indeed! I was surprised when @liparihttps://github.com/lipari showed cases where FCFS actually does an out-of-order schedule. But in the case of splash, their job sizes are all the same so FCFS with queue-depth being 1 should be the cheapest.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-core/issues/1358#issuecomment-373790137, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStaAnZ6RsDV9Ax_kYaR2YpPeiCK5uks5te_i6gaJpZM4Sssgp.

dongahn commented 6 years ago

Particularly assisting priority scheduling by doing a sort on the entire linked-list of jobs every time schedule_jobs is entered

This was the latest addition done by @lipari and @morrone. I'm wondering if there is a option to turn it off.

then traversing the entire hardware tree to clear reservations every time it’s entered as well. There may be others, but those make the performance relatively unfortunate for anything over about 1000 nodes or 1500 jobs in the queue.

FCFS shouldn't do this, though?

trws commented 6 years ago

The problem is that both are done outside of the plugin, so it happens regardless of the actual algorithm being used. I’m looking through it to see how that can be fixed without breaking some assumptions elsewhere. It may be we need another function for pre-schedule-loop setup in scheduler plugins to factor this out.

Sent from VMware Boxer

On March 16, 2018 at 11:53:05 AM PDT, Dong H. Ahn notifications@github.com wrote:

Particularly assisting priority scheduling by doing a sort on the entire linked-list of jobs every time schedule_jobs is entered

This was the latest addition done by @liparihttps://github.com/lipari and @morronehttps://github.com/morrone. I'm wondering if there is a option to turn it off.

then traversing the entire hardware tree to clear reservations every time it’s entered as well. There may be others, but those make the performance relatively unfortunate for anything over about 1000 nodes or 1500 jobs in the queue.

FCFS shouldn't do this, though?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-core/issues/1358#issuecomment-373811388, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStf_0-vrHCHBwXN3mNZ6AQKue-8nSks5tfAoCgaJpZM4Sssgp.

dongahn commented 6 years ago

Ah. Maybe sched_loop_setup routine can be modified to do this. Currently, the sched passes jobs info to sched_loop_setup but it can be modified so that the plug-in code can also pass back information on the plug-in's characteristics. (like whether it is reservations capable).

Of course, the setup code should be called first before resrc_tree_release_all_reservations operation though. (Also I haven't looked at what assumption are being made in resrc).

grondo commented 6 years ago

@trws, I'm trying to understand the "proper affinity" requirement. Currently the launcher (wrexecd) doesn't have any concept of whether a node is allocated in whole or in part to a launching job. Unfortunately, the prototype code only works off a count of "cores" assigned to each rank in [lwj-dir].rank.N.cores, and doubly unfortunately the name "cores" for this key is a misnomer as it is really assumed to be the number of tasks assigned to the rank.

There is currently a kludge in place that allows an affinity cpumask to be set in the per-rank directory of a job (at [lwj-dir].rank.N.cpumask)., which could be used to denote the actual cores by logical id that have been assigned on that rank, which in combination with the number of tasks could allow automatic affinity of individual tasks using a defined strategy.

I guess one of the main problems here is that in the wreck prototype we didn't bother separating assigned resources from assigned task counts and now we might have to think of a way around that, since the replacement execution system isn't baked yet.

Let me know what approach you'd like to take, or if you have other ideas, and I'll open a specific issue.

dongahn commented 6 years ago

There is currently a kludge in place that allows an affinity cpumask to be set in the per-rank directory of a job (at [lwj-dir].rank.N.cpumask)., which could be used to denote the actual cores by logical id that have been assigned on that rank, which in combination with the number of tasks could allow automatic affinity of individual tasks using a defined strategy.

@SteVwonder may want to chime in. I believe he modified this mechanism to get the affinity control he needs for the exploration of hierarchical scheduling.

grondo commented 6 years ago

I'm willing to do whatever you guys need in flux submit/wrexecd if it makes things easier. Even if sched just writes list of allocated cores in addition to counts I can make that work for this use case I think.

dongahn commented 6 years ago

@trws will have to weigh in. But if I understood him right, he's trying to coarsen the hardware tree representation for high scheduling scalability such that each socket vertex contains only one core pool (e.g., core[22] as opposed to 22 core vertices) and schedule cores in terms of its count. If this is successful, my guess is affinity may have be resolved at a different level.

grondo commented 6 years ago

Thanks @dongahn! I think I see.

If the scheduler is assigning non-differentiated cores from a pool, then I don't see a way any other part of the system can calculate what the proper affinity will be. The scheduler is the only thing that knows the global view of the system and which cores in a given socket are currently in use.

The best we can do for now is to bind tasks at the finest grain that the scheduler is tracking (sockets in this case), and let the OS distribute processes optimally within that cpumask.

grondo commented 6 years ago

A more tractable issue we can tackle in this timeframe is affinity for tasks on nodes where the job has been assigned the whole node. There are at least two problems that make this currently not work:

assumption that .rank.cores value is also the number of tasks assigned to the current rank. Possibly, there is no current way to assign more than one "core" to a task (oops...)
wrexecd doesn't currently know if it has the node exclusively. This can easily be fixed with a w/a for the current system (flag on rank.N directory, etc.)

In fact maybe 2 alone would work in the short term? wrexecd could still assume that rank.cores is the number of tasks to run, but if rank.exclusive flag was set, then it would distribute those tasks across cores on the current node using something equivalent to hwloc-distrib? (I really hate to continue the rank.cores hack, but also don't want to spend much time "fixing" the wreck code...)

trws commented 6 years ago

For the purpose of the splash app itself, this is not a major worry, since I intend to deal with binding of the GPUs in a script anyway I can do the CPU binding there as well. For ATS, we need to work this out. The two main cases I would like to find a way to support dovetail pretty well with our discussion of the submit command nnodes parameter:

nodes are allocated in full with a given number of tasks, result: each task gets as much independent cache as possible
a certain number of cores is allocated, but the node is not exclusive and as such cores are left available. For now having this be the same as number of tasks is fine, we'll want to loosen that as soon as reasonably possible, but it's fine for now, result could be either of two:
1. same as above, only problem is that then there will be overlapping core-sets as other jobs come into the node
2. bind each task to an individual core, as close as possible to one another without sharing cores (no use of PUs on the same core)

this hwloc function generates the cpuset for a given distribution based on a topology and a number of threads. It makes implementing the spread-out version a lot easier than doing it manually (bad, bad memories...). Doing the other one is a round-robin on available cores since you want the nearest ones anyway.

Does this make some sense?

trws commented 6 years ago

By the way, I am coarsening the hardware topology somewhat, but mainly to add levels we can't use productively, like multiple per-core caches. The main levels are left alone to avoid having to alter sched to handle something different.

grondo commented 6 years ago

Does this make some sense?

Thanks @trws, your notes above really help clarify the requirements. The issue is indeed simpler than I was initially thinking.

Unfortunately, the wreckrun prototype wasn't designed to handle these situations, so solving this for the short term will require some hackery.

The main issue for now is that the scheduler doesn't currently assign individual resources to a job, just a count of cores on each node. If we can tweak that to write individual resources (a list of cores per node), then this work would be fairly trivial. (We can perhaps use hwloc_bind() as suggested if it allows "unavailable" cores to be masked out)

If this is not currently possible, then we could handle case 1 easily by either a flag set by the scheduler that says the node is assigned exclusively, or by assuming the node is assigned exclusively when rank.cores = total cores. I don't think we want to use 3. above, since as you note it will hurt performance by default on shared nodes. Case 4 is only possible if individual resources are assigned by the scheduler, otherwise we would always assign tasks to the same cores since there is no information about other jobs assigned to the node.

So, here's my proposal:

Open issue against sched to write list/cpuset of cores assigned to kvs rank.N directory (will also accept global list of rank,cpuset if that is easier. (just need some way for wrexecd to determine which cores are assigned on its rank. As @garlick pointed out, this would be the first version of R_global or R_local)
Open issue against wrexecd to stop using rank.N.cores as the count of tasks to run on the node, and use global ntasks instead + local cores information to determine which and how many tasks to launch on each rank.
Open issue against wrexecd to set default job cpumask to the assigned cpuset (we could also use an actual cpuset, but that is probably overkill for now), and optionally use hwloc_distrib() within that cpuset to distribute tasks. (I'm assuming if the number of cores in the cpuset is equal to the number of things you are distributing then hwloc_distrib will assign a single core to each task)

Does this sound reasonable at all?

dongahn commented 6 years ago

The main issue for now is that the scheduler doesn't currently assign individual resources to a job, just a count of cores on each node. If we can tweak that to write individual resources (a list of cores per node), then this work would be fairly trivial. (We can perhaps use hwloc_bind() as suggested if it allows "unavailable" cores to be masked out)

If the core vertices are not pooled together in the resource representation, this should be straightforward, i think. From @twrs's comment below, it looks like he doesn't coarsen the core representation.

We will have to see how this scales though: the number of core vertices for Sierra = ~4000 x 44 = 176K.

Ultimately, we need aggressive pruning for resource searching. I have it on resource-query. Maybe we can add pruning by exclusivity on resrc if it's easy.

BTW, LSF supports core isolation which has proven to be needed to get good performance on Sierra. If Splash needs this, we need sched not to schedule tasks on those specialized cores for OS daemons.

It may be the case that if lsf does core isolation through cgroup, hwloc doesn't expose those specialized cores to our hwloc module, in which case we should be good.

grondo commented 6 years ago

It may be the case that if lsf does core isolation through cgroup, hwloc doesn't expose those specialized cores to our hwloc module, in which case we should be good.

Yeah, since the brokers are launched under LSF we should inherit the isolation provided by it.

trws commented 6 years ago

We would, but my understanding is that none of that has been done yet.

SteVwonder commented 6 years ago

W.r.t. the resrc_tree_release_all_reservations discussion in today's meeting, you should be able to comment out this function after making the resrc_tree_reserve in the fcfs plugin a noop.

dongahn commented 6 years ago

Two action items from today's discussion on this topic:

Even if the job needs only one node with FCFS policy, currently resrc_tree_release_all_reservations is called from within the schedule loop which then traverses the entire resource tree -- a big performance issue. But reservations are only needed for backfill schedulers, and we want to find a way to turn this off for FCFS without having to break any assumption. This is a blocker for splash use case. TODO: @dongahn to look at those assumptions. (@SteVwonder above answer this question, though). Ultimately, this needs to go to the scheduler policy plugin.
Head of line blocking at the message queue seems to cause sched to stall when job submission rate is super high. Three potential issues discussed: 1) too many round trips between kvs and sched to get job information through JSC. One optimization would be to piggyback the submitted job information to wreck.state event (@grondo and @dongahn to look into this?); 2) too many synchronous kvs operations within JSC. One optimization would be to use async kvs within jsc; (@garlick to look into this) 3) converting json strings to json objects and son objects back to json strings within JCS. This isn't a blocker for splash use case, but we agreed it would be good to look at soonish.

grondo commented 6 years ago

One optimization would be to piggyback the submitted job information to wreck.state event

Looking back through the job module, I see that currently jobs first enter the "reserved" state before transitioning to "submitted". In addition to including nnodes,ntasks information in the wreck.state.submitted event, we could also skip the "reserved" state for submitted jobs (I don't see how it is necessary) and save 1 event per submitted job, if that is not an issue for sched.

trws commented 6 years ago

We could save more than that really, there’s also a null->null transition event. Sched would be perfectly happy about that, it just falls through a switch statement to implement both currently.

On 22 Mar 2018, at 14:24, Mark Grondona wrote:

One optimization would be to piggyback the submitted job information to wreck.state event

Looking back through the job module, I see that currently jobs first enter the "reserved" state before transitioning to "submitted". In addition to including nnodes,ntasks information in the wreck.state.submitted event, we could also skip the "reserved" state for submitted jobs (I don't see how it is necessary) and save 1 event per submitted job, if that is not an issue for sched.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/1358#issuecomment-375462315

dongahn commented 6 years ago

Even if these sched didn't do this, reducing state transitions within sched should be pretty easy. What we need is just a contract between wreck and sched.

grondo commented 6 years ago

For wreck the "reserved" state just means that the KVS directory is reserved for wreck as writer. In the case of flux-submit the state immediately transitions to "submitted" without any further data so let's just say the contract is that "reserved" state should be ignored by other consumers, and "submitted" state is the initial state for a job submitted to the scheduler.

I have no idea what the null->null transition signifies.

garlick commented 6 years ago

libjsc inserts the null->null transition when the "reserved" state event is received from wreck. It calls callbacks once for null->null, and once for null->reserved. @dongahn, what's the rationale here? I'm taking out the null->null transition in my test branch and fixing up the sharness tests to not expect it, but will I break something in sched?

dongahn commented 6 years ago

@garlick: I vaguely remember I needed something like that for various synchronization issues that lead to RFC 9. I have to see if that fixup is needed now we only use event to pass the job state. Let me look.

dongahn commented 6 years ago

@SteVwonder and @trws: I will try to do a quick PR for the resrc_tree_release_all_reservations issue. I propose:

Introduce get_sched_properties () into the sched plugin API as well as struct sched_prop to encode the properties of the scheduler plugins. I can support "backfill capable" as the only property to support at this point.
sched framework queries this property during the initialization and the scheduler loop calls resrc_tree_release_all_reservations only if the loaded plugin is backfill capable.
As @SteVwonder noted, I will make resrc_tree_reserve a NOOP in resource resources API call.

grondo commented 6 years ago

ok, my wreck-experimental branch has the changes discussed here

eliminate 'reserved' state for submitted jobs
include ntasks,nnodes,walltime from job request in the wreck.state.submitted event

dongahn commented 6 years ago

ok, i will test this soon. Just to make sure I understand, wreck.state.reserved events are not emitted any longer with this version?

grondo commented 6 years ago

ok, i will test this soon. Just to make sure I understand, wreck.state.reserved events are not emitted any longer with this version?

They are emitted only with flux wreckrun, that is when we bypass the scheduler.

dongahn commented 6 years ago

Even if the job needs only one node with FCFS policy, currently resrc_tree_release_all_reservations is called from within the schedule loop which then traverses the entire resource tree -- a big performance issue. But reservations are only needed for backfill schedulers, and we want to find a way to turn this off for FCFS without having to break any assumption. This is a blocker for splash use case. TODO: @dongahn to look at those assumptions. (@SteVwonder above answer this question, though). Ultimately, this needs to go to the scheduler policy plugin.

It turned out there is more than making resrc_tree_reserve in the fcfs plugin NOOP. Simply doing the work as https://github.com/flux-framework/flux-core/issues/1358#issuecomment-375483508 schedules jobs out of order!

I will look more closely into this later.

dongahn commented 6 years ago

I suspect that FCFS scheduler actually requires the reservation capability for the general case.

If the first job doesn't find all of the required resources, the scheduler should reserve those partially found resources so that the scheduler can move on to the next job to see if it can be scheduled.

The reason for this out of order behavior is that the next job may only require a different type of resources than the first job and the fact that the first job is not scheduled shouldn't prevent it from being scheduled -- that is, without having to use the resources that the first job can use at the next schedule loop.

I think we can still remobe the release reservarion, if we assume the FCFS scheduler uses queue-depth=1. I will see if I can make this a special case.

Scheduler optimization like this is within the grand scheme of scheduler specialization. So special casing like this should 't be that bad. That is, as far as we can manage the complexity with config files etc later on.

SteVwonder commented 6 years ago

I think we can still remobe the release reservarion, if we assume the FCFS scheduler uses queue-depth=1. I will see if I can make this a special case.

@dongahn, that's a great point. The queue-depth=1 requirement can be worked around by having the first call to reserve_resources set a boolean flag that causes subsequent calls to allocate_resources (within the same sched loop) immediately return. A call to sched_loop_setup could reset the boolean flag so that allocate_resources is no longer a NOOP.

The reason for this out of order behavior is that the next job may only require a different type of resources than the first job and the fact that the first job is not scheduled shouldn't prevent it from being scheduled -- that is, without having to use the resources that the first job can use at the next schedule loop.

To be pedantic, FCFS should immediately stop checking jobs once the first job in the queue fails to be scheduled. FCFS always serves the job that has been waiting the longest (i.e., the job that came first). To have a jobs that isn't first in the queue be scheduled wouldn't be FCFS. That would be back-filling with the reservation depth >= 0. That doesn't need to be addressed here, but we should consider making this the behavior of our FCFS plugin to avoid surprising users in the future.

dongahn commented 6 years ago

To be pedantic, FCFS should immediately stop checking jobs once the first job in the queue fails to be scheduled. FCFS always serves the job that has been waiting the longest (i.e., the job that came first). To have a jobs that isn't first in the queue be scheduled wouldn't be FCFS. That would be back-filling with the reservation depth >= 0. That doesn't need to be addressed here, but we should consider making this the behavior of our FCFS plugin to avoid surprising users in the future.

Good point. However, I think this was discussed @lipari and I think we agrees that this should be the behavior. (I vaguely remember he convinced us that other schedulers implement FCFS this way). I already have a PR for this, could you review? We can revisit this semantics later if needed though.

trws commented 6 years ago

Honestly, I'm fine with out of order for now. You're certainly right that it doesn't fit the model, but for the current push it kinda doesn't matter.

trws commented 6 years ago

Also, true fcfs is, as stephen mentions, always depth one. It's an odd side-effect of the decomposition of sched that fcfs implements a partial backfill at the moment.

dongahn commented 6 years ago

True. We can call the depth one pedandic FCFS and depth > 1 optimized FCFS.

At the end of the day, queuing policy + scheduler parameters will determine the performance of the scheduler and will serve as our knobs to specialize our scheduling tailored to the workload.

At some point we should name policy plugin + parameters for some of the representative workloads like "HTC small jobs" although we should still expose the individual knobs to users as well.

trws commented 6 years ago

Sounds like a good idea to me.

On 23 Mar 2018, at 9:05, Dong H. Ahn wrote:

True. We can call the depth one pedandic FCFS and depth > 1 optimized FCFS.

At the end of the day, queuing policy + scheduler parameters will determine the performance of the scheduler and will serve as our knobs to specialize our scheduling tailored to the workload.

At some point we should name policy plugin + parameters for some of the representative workloads like "HTC small jobs" although we should still expose the individual knobs to users as well.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/1358#issuecomment-375715319

dongahn commented 6 years ago

libjsc inserts the null->null transition when the "reserved" state event is received from wreck. It calls callbacks once for null->null, and once for null->reserved. @dongahn, what's the rationale here? I'm taking out the null->null transition in my test branch and fixing up the sharness tests to not expect it, but will I break something in sched?

I looked at the code. It seems I did it this way to work around some odd synchronization problems between kvs update and eventing. Maybe I can find the detail from an issue ticket... Since I need to tune the logic w/ respect to @grondo's augmented submit event change, I will see if I can remove this transition.

dongahn commented 6 years ago

@garlick: okay I found https://github.com/flux-framework/flux-core/pull/205#issuecomment-106537901

Essentially, this was to break a race condition. My guess is now we are using event to emit the state, we would be able to live without this transition...

garlick commented 6 years ago

I'll go ahead and close this, now that we have a project board for splash.

@trws, @grondo, @dongahn, and @SteVwonder may want to review the discussion in this issue quicly, to determine if anything was discussed that didn't get peeled off into its own issue.

flux-framework / flux-core

splash app tracking issue #1358

1356 getting flux wreck purge going so we can garbage collect the wreck directories

1359 cancel jobs that haven't started yet

1361 hwloc system loading non-functional at scale

1355 czmq dependency checking issue with versions 4+

1360 jobspec: change configure logic to opt-out