Closed trws closed 6 years ago
(sent this in email to @trws but should have posted here for one stop shopping): Re: redirecting sqlite db to a filesystem with plenty of free space:
By default, flux creates its sqlite db in "broker.rundir", which is normally in /tmp. An instance that has a lot of KVS activity might want to redirect it elsewhere:
flux start -o,-Spersist-directory=/new/directory
/new/directory must already exist. A "content" subdirectory will be created within it to contain the database.
one node, four processes, in size with proper affinity
Flux-core (via wreck) supports setting affinities through the cpumask
key in the KVS [src code], but flux-sched does not currently populate that information when scheduling jobs. I have code in a branch of flux-sched that does create this, but I have only tested the code against ~v0.5.0 of flux-core, and its not the most polished c code...sorry 😞 .
each process should be near a GPU and have a GPU selected
Does Sierra support binding processes to specific GPUs? I imagine supporting this will require work both in wreck and flux-sched. How important is locality here? Is it just picking the "closest" socket (e.g., GPUs 1-3 are closer to socket1 and GPUs 4-6 are closer to socket2), or is it more complex?
It should be possible to cancel a job in flight if its priority falls below a certain level
Do they need the jobs to be considered/scheduled in order of their priority? Or are all jobs currently in the queue fair game for scheduling?
The affinity stuff will almost certainly be handled by a script for this, just for time reasons. GPU affinity can be set by an environment variable, and closest socket is sufficient.
All jobs in the queue are fair game for scheduling, priority is being handled externally.
Copying here from #1356 -- instructions for running flux wreck purge
periodically within an instance using flux cron
:
flux cron event --min-interval=10 --nth=1000 wreck.state.complete flux wreck purge -R -t 1000
FYI -- I already sent @trws as to how one can optimize the scheduler for HTC workloads. But in case others need to support this types of workloads, queue-depth and delay-ached should be useful.
https://github.com/flux-framework/flux-sched/pull/190 https://github.com/flux-framework/flux-sched/issues/182 https://github.com/flux-framework/flux-sched/issues/183
https://github.com/flux-framework/flux-sched/pull/191 https://github.com/flux-framework/flux-sched/issues/185
Examples: https://github.com/flux-framework/flux-sched/blob/master/t/t1005-sched-params.t#L45 https://github.com/flux-framework/flux-sched/blob/master/t/t1005-sched-params.t#L67
Also mentioned in the email chain:
In this scenario (i.e., all jobs are equal size), FCFS is a good candidate over backfilling. FCFS avoids the cost of making reservations in the future as well as avoids attempting to schedule jobs in the queue that are provably not able to be scheduled. In addition to loading the sched.fcfs
plugin, I believe you also need to set queue-depth
to 1.
I believe you also need to set queue-depth to 1.
Indeed! I was surprised when @lipari showed cases where FCFS actually does an out-of-order schedule. But in the case of splash, their job sizes are all the same so FCFS with queue-depth
being 1 should be the cheapest.
I’ll definitely try that, a lower queue depth helped a little bit, but it looks like we had some non-scalable things in the sched loop too. Particularly assisting priority scheduling by doing a sort on the entire linked-list of jobs every time schedule_jobs is entered, and then traversing the entire hardware tree to clear reservations every time it’s entered as well. There may be others, but those make the performance relatively unfortunate for anything over about 1000 nodes or 1500 jobs in the queue.
Sent from VMware Boxer
On March 16, 2018 at 10:39:36 AM PDT, Dong H. Ahn notifications@github.com wrote:
I believe you also need to set queue-depth to 1.
Indeed! I was surprised when @liparihttps://github.com/lipari showed cases where FCFS actually does an out-of-order schedule. But in the case of splash, their job sizes are all the same so FCFS with queue-depth being 1 should be the cheapest.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-core/issues/1358#issuecomment-373790137, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStaAnZ6RsDV9Ax_kYaR2YpPeiCK5uks5te_i6gaJpZM4Sssgp.
Particularly assisting priority scheduling by doing a sort on the entire linked-list of jobs every time schedule_jobs is entered
This was the latest addition done by @lipari and @morrone. I'm wondering if there is a option to turn it off.
then traversing the entire hardware tree to clear reservations every time it’s entered as well. There may be others, but those make the performance relatively unfortunate for anything over about 1000 nodes or 1500 jobs in the queue.
FCFS shouldn't do this, though?
The problem is that both are done outside of the plugin, so it happens regardless of the actual algorithm being used. I’m looking through it to see how that can be fixed without breaking some assumptions elsewhere. It may be we need another function for pre-schedule-loop setup in scheduler plugins to factor this out.
Sent from VMware Boxer
On March 16, 2018 at 11:53:05 AM PDT, Dong H. Ahn notifications@github.com wrote:
Particularly assisting priority scheduling by doing a sort on the entire linked-list of jobs every time schedule_jobs is entered
This was the latest addition done by @liparihttps://github.com/lipari and @morronehttps://github.com/morrone. I'm wondering if there is a option to turn it off.
then traversing the entire hardware tree to clear reservations every time it’s entered as well. There may be others, but those make the performance relatively unfortunate for anything over about 1000 nodes or 1500 jobs in the queue.
FCFS shouldn't do this, though?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-core/issues/1358#issuecomment-373811388, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStf_0-vrHCHBwXN3mNZ6AQKue-8nSks5tfAoCgaJpZM4Sssgp.
Ah. Maybe sched_loop_setup
routine can be modified to do this. Currently, the sched passes jobs info to sched_loop_setup
but it can be modified so that the plug-in code can also pass back information on the plug-in's characteristics. (like whether it is reservations capable).
Of course, the setup code should be called first before resrc_tree_release_all_reservations
operation though. (Also I haven't looked at what assumption are being made in resrc
).
@trws, I'm trying to understand the "proper affinity" requirement. Currently the launcher (wrexecd
) doesn't have any concept of whether a node is allocated in whole or in part to a launching job. Unfortunately, the prototype code only works off a count of "cores" assigned to each rank in [lwj-dir].rank.N.cores
, and doubly unfortunately the name "cores" for this key is a misnomer as it is really assumed to be the number of tasks assigned to the rank.
There is currently a kludge in place that allows an affinity cpumask to be set in the per-rank directory of a job (at [lwj-dir].rank.N.cpumask
)., which could be used to denote the actual cores by logical id that have been assigned on that rank, which in combination with the number of tasks could allow automatic affinity of individual tasks using a defined strategy.
I guess one of the main problems here is that in the wreck prototype we didn't bother separating assigned resources from assigned task counts and now we might have to think of a way around that, since the replacement execution system isn't baked yet.
Let me know what approach you'd like to take, or if you have other ideas, and I'll open a specific issue.
There is currently a kludge in place that allows an affinity cpumask to be set in the per-rank directory of a job (at [lwj-dir].rank.N.cpumask)., which could be used to denote the actual cores by logical id that have been assigned on that rank, which in combination with the number of tasks could allow automatic affinity of individual tasks using a defined strategy.
@SteVwonder may want to chime in. I believe he modified this mechanism to get the affinity control he needs for the exploration of hierarchical scheduling.
I'm willing to do whatever you guys need in flux submit/wrexecd
if it makes things easier. Even if sched just writes list of allocated cores in addition to counts I can make that work for this use case I think.
@trws will have to weigh in. But if I understood him right, he's trying to coarsen the hardware tree representation for high scheduling scalability such that each socket vertex contains only one core pool (e.g., core[22] as opposed to 22 core vertices) and schedule cores in terms of its count. If this is successful, my guess is affinity may have be resolved at a different level.
Thanks @dongahn! I think I see.
If the scheduler is assigning non-differentiated cores from a pool, then I don't see a way any other part of the system can calculate what the proper affinity will be. The scheduler is the only thing that knows the global view of the system and which cores in a given socket are currently in use.
The best we can do for now is to bind tasks at the finest grain that the scheduler is tracking (sockets in this case), and let the OS distribute processes optimally within that cpumask.
A more tractable issue we can tackle in this timeframe is affinity for tasks on nodes where the job has been assigned the whole node. There are at least two problems that make this currently not work:
.rank.cores
value is also the number of tasks assigned to the current rank. Possibly, there is no current way to assign more than one "core" to a task (oops...)In fact maybe 2 alone would work in the short term? wrexecd could still assume that rank.cores
is the number of tasks to run, but if rank.exclusive
flag was set, then it would distribute those tasks across cores on the current node using something equivalent to hwloc-distrib
? (I really hate to continue the rank.cores
hack, but also don't want to spend much time "fixing" the wreck code...)
For the purpose of the splash app itself, this is not a major worry, since I intend to deal with binding of the GPUs in a script anyway I can do the CPU binding there as well. For ATS, we need to work this out. The two main cases I would like to find a way to support dovetail pretty well with our discussion of the submit command nnodes parameter:
this hwloc function generates the cpuset for a given distribution based on a topology and a number of threads. It makes implementing the spread-out version a lot easier than doing it manually (bad, bad memories...). Doing the other one is a round-robin on available cores since you want the nearest ones anyway.
Does this make some sense?
By the way, I am coarsening the hardware topology somewhat, but mainly to add levels we can't use productively, like multiple per-core caches. The main levels are left alone to avoid having to alter sched to handle something different.
Does this make some sense?
Thanks @trws, your notes above really help clarify the requirements. The issue is indeed simpler than I was initially thinking.
Unfortunately, the wreckrun prototype wasn't designed to handle these situations, so solving this for the short term will require some hackery.
The main issue for now is that the scheduler doesn't currently assign individual resources to a job, just a count of cores on each node. If we can tweak that to write individual resources (a list of cores per node), then this work would be fairly trivial. (We can perhaps use hwloc_bind()
as suggested if it allows "unavailable" cores to be masked out)
If this is not currently possible, then we could handle case 1 easily by either a flag set by the scheduler that says the node is assigned exclusively, or by assuming the node is assigned exclusively when rank.cores
= total cores. I don't think we want to use 3. above, since as you note it will hurt performance by default on shared nodes. Case 4 is only possible if individual resources are assigned by the scheduler, otherwise we would always assign tasks to the same cores since there is no information about other jobs assigned to the node.
So, here's my proposal:
rank.N
directory (will also accept global list of rank,cpuset if that is easier. (just need some way for wrexecd to determine which cores are assigned on its rank. As @garlick pointed out, this would be the first version of Rglobal or Rlocal)rank.N.cores
as the count of tasks to run on the node, and use global ntasks instead + local cores information to determine which and how many tasks to launch on each rank.hwloc_distrib()
within that cpuset to distribute tasks. (I'm assuming if the number of cores in the cpuset is equal to the number of things you are distributing then hwloc_distrib
will assign a single core to each task)Does this sound reasonable at all?
The main issue for now is that the scheduler doesn't currently assign individual resources to a job, just a count of cores on each node. If we can tweak that to write individual resources (a list of cores per node), then this work would be fairly trivial. (We can perhaps use hwloc_bind() as suggested if it allows "unavailable" cores to be masked out)
If the core vertices are not pooled together in the resource representation, this should be straightforward, i think. From @twrs's comment below, it looks like he doesn't coarsen the core representation.
We will have to see how this scales though: the number of core vertices for Sierra = ~4000 x 44 = 176K.
Ultimately, we need aggressive pruning for resource searching. I have it on resource-query. Maybe we can add pruning by exclusivity on resrc if it's easy.
BTW, LSF supports core isolation which has proven to be needed to get good performance on Sierra. If Splash needs this, we need sched not to schedule tasks on those specialized cores for OS daemons.
It may be the case that if lsf does core isolation through cgroup, hwloc doesn't expose those specialized cores to our hwloc module, in which case we should be good.
It may be the case that if lsf does core isolation through cgroup, hwloc doesn't expose those specialized cores to our hwloc module, in which case we should be good.
Yeah, since the brokers are launched under LSF we should inherit the isolation provided by it.
We would, but my understanding is that none of that has been done yet.
W.r.t. the resrc_tree_release_all_reservations
discussion in today's meeting, you should be able to comment out this function after making the resrc_tree_reserve
in the fcfs plugin a noop.
Two action items from today's discussion on this topic:
Even if the job needs only one node with FCFS policy, currently resrc_tree_release_all_reservations
is called from within the schedule loop which then traverses the entire resource tree -- a big performance issue. But reservations are only needed for backfill schedulers, and we want to find a way to turn this off for FCFS without having to break any assumption. This is a blocker for splash use case. TODO: @dongahn to look at those assumptions. (@SteVwonder above answer this question, though). Ultimately, this needs to go to the scheduler policy plugin.
Head of line blocking at the message queue seems to cause sched to stall when job submission rate is super high. Three potential issues discussed: 1) too many round trips between kvs and sched to get job information through JSC. One optimization would be to piggyback the submitted job information to wreck.state
event (@grondo and @dongahn to look into this?); 2) too many synchronous kvs operations within JSC. One optimization would be to use async kvs within jsc; (@garlick to look into this) 3) converting json strings to json objects and son objects back to json strings within JCS. This isn't a blocker for splash use case, but we agreed it would be good to look at soonish.
One optimization would be to piggyback the submitted job information to wreck.state event
Looking back through the job module, I see that currently jobs first enter the "reserved" state before transitioning to "submitted". In addition to including nnodes
,ntasks
information in the wreck.state.submitted
event, we could also skip the "reserved" state for submitted jobs (I don't see how it is necessary) and save 1 event per submitted job, if that is not an issue for sched.
We could save more than that really, there’s also a null->null transition event. Sched would be perfectly happy about that, it just falls through a switch statement to implement both currently.
On 22 Mar 2018, at 14:24, Mark Grondona wrote:
One optimization would be to piggyback the submitted job information to wreck.state event
Looking back through the job module, I see that currently jobs first enter the "reserved" state before transitioning to "submitted". In addition to including
nnodes
,ntasks
information in thewreck.state.submitted
event, we could also skip the "reserved" state for submitted jobs (I don't see how it is necessary) and save 1 event per submitted job, if that is not an issue for sched.-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/1358#issuecomment-375462315
Even if these sched didn't do this, reducing state transitions within sched should be pretty easy. What we need is just a contract between wreck and sched.
For wreck the "reserved" state just means that the KVS directory is reserved for wreck as writer. In the case of flux-submit
the state immediately transitions to "submitted" without any further data so let's just say the contract is that "reserved" state should be ignored by other consumers, and "submitted" state is the initial state for a job submitted to the scheduler.
I have no idea what the null->null transition signifies.
libjsc inserts the null->null transition when the "reserved" state event is received from wreck. It calls callbacks once for null->null, and once for null->reserved. @dongahn, what's the rationale here? I'm taking out the null->null transition in my test branch and fixing up the sharness tests to not expect it, but will I break something in sched?
@garlick: I vaguely remember I needed something like that for various synchronization issues that lead to RFC 9. I have to see if that fixup is needed now we only use event to pass the job state. Let me look.
@SteVwonder and @trws: I will try to do a quick PR for the resrc_tree_release_all_reservations
issue. I propose:
get_sched_properties ()
into the sched plugin API as well as struct sched_prop
to encode the properties of the scheduler plugins. I can support "backfill capable" as the only property to support at this point.resrc_tree_release_all_reservations
only if the loaded plugin is backfill capable.resrc_tree_reserve
a NOOP in resource resources
API call.ok, my wreck-experimental branch has the changes discussed here
wreck.state.submitted
eventok, i will test this soon. Just to make sure I understand, wreck.state.reserved
events are not emitted any longer with this version?
ok, i will test this soon. Just to make sure I understand, wreck.state.reserved events are not emitted any longer with this version?
They are emitted only with flux wreckrun
, that is when we bypass the scheduler.
- Even if the job needs only one node with FCFS policy, currently resrc_tree_release_all_reservations is called from within the schedule loop which then traverses the entire resource tree -- a big performance issue. But reservations are only needed for backfill schedulers, and we want to find a way to turn this off for FCFS without having to break any assumption. This is a blocker for splash use case. TODO: @dongahn to look at those assumptions. (@SteVwonder above answer this question, though). Ultimately, this needs to go to the scheduler policy plugin.
It turned out there is more than making resrc_tree_reserve
in the fcfs plugin NOOP. Simply doing the work as https://github.com/flux-framework/flux-core/issues/1358#issuecomment-375483508 schedules jobs out of order!
I will look more closely into this later.
I suspect that FCFS scheduler actually requires the reservation capability for the general case.
If the first job doesn't find all of the required resources, the scheduler should reserve those partially found resources so that the scheduler can move on to the next job to see if it can be scheduled.
The reason for this out of order behavior is that the next job may only require a different type of resources than the first job and the fact that the first job is not scheduled shouldn't prevent it from being scheduled -- that is, without having to use the resources that the first job can use at the next schedule loop.
I think we can still remobe the release reservarion, if we assume the FCFS scheduler uses queue-depth=1
. I will see if I can make this a special case.
Scheduler optimization like this is within the grand scheme of scheduler specialization. So special casing like this should 't be that bad. That is, as far as we can manage the complexity with config files etc later on.
I think we can still remobe the release reservarion, if we assume the FCFS scheduler uses queue-depth=1. I will see if I can make this a special case.
@dongahn, that's a great point. The queue-depth=1
requirement can be worked around by having the first call to reserve_resources
set a boolean flag that causes subsequent calls to allocate_resources
(within the same sched loop) immediately return. A call to sched_loop_setup
could reset the boolean flag so that allocate_resources
is no longer a NOOP.
The reason for this out of order behavior is that the next job may only require a different type of resources than the first job and the fact that the first job is not scheduled shouldn't prevent it from being scheduled -- that is, without having to use the resources that the first job can use at the next schedule loop.
To be pedantic, FCFS
should immediately stop checking jobs once the first job in the queue fails to be scheduled. FCFS
always serves the job that has been waiting the longest (i.e., the job that came first). To have a jobs that isn't first in the queue be scheduled wouldn't be FCFS
. That would be back-filling with the reservation depth >= 0
. That doesn't need to be addressed here, but we should consider making this the behavior of our FCFS
plugin to avoid surprising users in the future.
To be pedantic, FCFS should immediately stop checking jobs once the first job in the queue fails to be scheduled. FCFS always serves the job that has been waiting the longest (i.e., the job that came first). To have a jobs that isn't first in the queue be scheduled wouldn't be FCFS. That would be back-filling with the reservation depth >= 0. That doesn't need to be addressed here, but we should consider making this the behavior of our FCFS plugin to avoid surprising users in the future.
Good point. However, I think this was discussed @lipari and I think we agrees that this should be the behavior. (I vaguely remember he convinced us that other schedulers implement FCFS this way). I already have a PR for this, could you review? We can revisit this semantics later if needed though.
Honestly, I'm fine with out of order for now. You're certainly right that it doesn't fit the model, but for the current push it kinda doesn't matter.
Also, true fcfs is, as stephen mentions, always depth one. It's an odd side-effect of the decomposition of sched that fcfs implements a partial backfill at the moment.
True. We can call the depth one pedandic FCFS and depth > 1 optimized FCFS.
At the end of the day, queuing policy + scheduler parameters will determine the performance of the scheduler and will serve as our knobs to specialize our scheduling tailored to the workload.
At some point we should name policy plugin + parameters for some of the representative workloads like "HTC small jobs" although we should still expose the individual knobs to users as well.
Sounds like a good idea to me.
On 23 Mar 2018, at 9:05, Dong H. Ahn wrote:
True. We can call the depth one pedandic FCFS and depth > 1 optimized FCFS.
At the end of the day, queuing policy + scheduler parameters will determine the performance of the scheduler and will serve as our knobs to specialize our scheduling tailored to the workload.
At some point we should name policy plugin + parameters for some of the representative workloads like "HTC small jobs" although we should still expose the individual knobs to users as well.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/1358#issuecomment-375715319
libjsc inserts the null->null transition when the "reserved" state event is received from wreck. It calls callbacks once for null->null, and once for null->reserved. @dongahn, what's the rationale here? I'm taking out the null->null transition in my test branch and fixing up the sharness tests to not expect it, but will I break something in sched?
I looked at the code. It seems I did it this way to work around some odd synchronization problems between kvs update and eventing. Maybe I can find the detail from an issue ticket... Since I need to tune the logic w/ respect to @grondo's augmented submit event change, I will see if I can remove this transition.
@garlick: okay I found https://github.com/flux-framework/flux-core/pull/205#issuecomment-106537901
Essentially, this was to break a race condition. My guess is now we are using event to emit the state, we would be able to live without this transition...
I'll go ahead and close this, now that we have a project board for splash.
@trws, @grondo, @dongahn, and @SteVwonder may want to review the discussion in this issue quicly, to determine if anything was discussed that didn't get peeled off into its own issue.
This is to help tie together information for the splash app for sierra.
General problem: Run between 1 and 2 million jobs of one node, four processes, in size with proper affinity, each process should be near a GPU and have a GPU selected, over the course of a day across four thousand nodes. It should be possible to cancel a job in flight if its priority falls below a certain level, this logic doesn't have to be in flux but the cancel mechanism needs to be available. In order to deal with the large number of jobs, we need to be able to handle long queues and fast submission simultaneously for startup and a large number of jobs over the course of a full run, purging old jobs is an acceptable solution even if it loses job information for completed jobs.
Currently the filed issues are:
high priority:
1356 getting flux wreck purge going so we can garbage collect the wreck directories
1359 cancel jobs that haven't started yet
1361 hwloc system loading non-functional at scale
Found in this effort, non-blocking:
1355 czmq dependency checking issue with versions 4+
1360 jobspec: change configure logic to opt-out
Tagging for notification: @grondo, @garlick, @SteVwonder, @dongahn