flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
159 stars 49 forks source link

Job-shape specification #354

Closed trws closed 7 years ago

trws commented 8 years ago

This ties back to the job submission API #268, shell based exec #334 and RFC 8.

We have a concept that I'm finding I really like, that being the task slot, but we haven't specced out how users will specify the shape, number and distribution of task slots, nor how their tasks will map to the aforementioned task slots. This issue is for discussion on the matter and an initial implementation.

At least for me, I'm finding this a much more digestible concept than the classic node/task count used in other places. Enough so that I propose we build everything around number of tasks, a default task slot shape, and specializations on individual task slots or task mappings as necessary. Reason being, it's completely unambiguous what every task gets, and the scheduler/system can place them as needed, as opposed to having to determine how to jam tasks into nodes without explicit context. This should simplify containerization etc. and we can provide the other interface style on top of it easily enough by defining a mapping.

This is what I'm targeting now:

Components:

I'm going to put together a prototype interface in capacitor for this today so we can see how we like it, but I'd like to hear everyone's thoughts on the matter. @dongahn, @garlick, @grondo, @lipari

For a quick summary of the discussion that follows, jump to:

https://github.com/flux-framework/flux-core/issues/354#issuecomment-225686820

grondo commented 8 years ago

I like this! It actually isn't far off (terminology notwithstanding) the design we've always been talking about.

(Sorry about the persistence of node/core-centric launch in the prototype! This was only done to get the prototype launch going quickly)

Just a few questions/comments to get more discussion here:

lipari commented 8 years ago

@grondo beat me to the post. I had a response in progress mentioning pretty much everything his did: flexible task mapping, containment, multiple tasks per slot. The slot concept could be stretched to cover everything from a flux instance request for multiple resources, to the sub-instances where the resource request specifies multiple tasks mapped to more granular resources.

trws commented 8 years ago

Thanks for the comments!

  1. This does make it very general, and it brings up a question. I really prefer just saying that there is one task per task slot, but find that the idea of having multiple task slots share the same physical resource a problem. The reason I want to allow multiple tasks in a slot is so that we can assert that all task slots contain mutually exclusive sets of hardware resources. This is really, really useful for writing binding logic, but it might be easier to deal with some of the logic if it's one task per slot.
  2. Shape is probably not necessary, it's just the term I landed on for referring to the description of a task slot's resource requirements, as one would include in a job spec.
  3. Actually yes. A traditional RM faced with that packs T tasks on to N nodes with a given assignment order, I expect we'll want to offer an easy way to ask for that. That said, I would prefer that users request T tasks, and specify exactly what each task should be run on. For example, task-shape = {'node' : {'num' = 1}} for one node per task, task-shape = {'socket':{'num' = 1}}, for a single socket, task-shape = {'socket' : {'num' = 1, 'has' = {'core' = {'$gt' = 2}}}}, for a socket with at least two cores. Given just N and T, we can't really do any better than what others do, but given what the user wants for the general case, and maybe for certain individual slots, we can be a whole lot smarter about it.
  4. I completely agree with this, and realize I didn't really say so in the original post. It should be an extensible mechanism for sure, but I think the ones I listed should be available as part of the flux distribution, and would argue for the relatively complex but very nice for affinity/binding "distribute" policy to be the default, short of finding something we like better.
  5. I agree, but since affinity/binding is so important to job performance, I think it's worth saying that these are at least logically "containers" that tasks should not be assume they can expand out of.
grondo commented 8 years ago

For affinity binding, we should probably talk to Edgar Leon. He has been studying default affinity bindings for the past few years and has a script that implements known good policies for binding based on job parameters, number of sockets/cores, existence of hyperthreads, etc. This policy will replace our current default bindings implemented by the slurm auto-affinity plugin in the next year or so.

I think we'd always thought that this kind of per-task plugin for binding as used in SLURM will be promoted to a "layout" or to use your term "shaping" plugin in flux. The shaping should be done before launch and the shell will read shaping information (which is really just the specific resources assigned to each task) directly from kvs or other information store, and act on it directly.

Once we nail this down a bit we should consult with Edgar and ensure his mapping rules could be adopted into the set of job/program shaping plugins/methods for flux.

grondo commented 8 years ago

This does make it very general, and it brings up a question. I really prefer just saying that there is one task per task slot, but find that the idea of having multiple task slots share the same physical resource a problem. The reason I want to allow multiple tasks in a slot is so that we can assert that all task slots contain mutually exclusive sets of hardware resources. This is really, really useful for writing binding logic, but it might be easier to deal with some of the logic if it's one task per slot.

I think I see your point about task slots on mutually exclusive hardware, but aren't you then essentially asserting that all task slots are nodes? And in the case where arbitrary node parent resources are in the hierarchy, can you share those resources on a task slot (e.g. switch resources)? (i.e. where do you draw the line on what is a shared resource)

Reading back your response, I might still be a little confused on whether you want to bind multiple tasks to resources within a task slot (what, then, do you call the resources to which a task is bound within a task slot), or if multiple tasks are bound to a task slot (this seems to be a given) or a bit of both?

trws commented 8 years ago

I've actually talked to him about this some already, originally snagging him to discuss it orthogonally to flux because thread affinity was the focus of my dissertation work starting from 2007, and I thought we'd have some collaboration opportunities. More recently we discussed getting his "mpibind2" script up to date using the OpenMP 4.0 hooks, since without them it only works for GNU and Intel OpenMP thread binding.

Based on those discussions, everything he is currently doing and what he mentioned wanting to do with slurm could be supported on this without issue. Still, keeping in touch with him as both his work and flux evolve is definitely a good idea.

On 24 Aug 2015, at 9:58, Mark Grondona wrote:

For affinity binding, we should probably talk to Edgar Leon. He has been studying default affinity bindings for the past few years and has a script that implements known good policies for binding based on job parameters, number of sockets/cores, existence of hyperthreads, etc. This policy will replace our current default bindings implemented by the slurm auto-affinity plugin in the next year or so.

I think we'd always thought that this kind of per-task plugin for binding as used in SLURM will be promoted to a "layout" or to use your term "shaping" plugin in flux. The shaping should be done before launch and the shell will read shaping information (which is really just the specific resources assigned to each task) directly from kvs or other information store, and act on it directly.

Once we nail this down a bit we should consult with Edgar and ensure his mapping rules could be adopted into the set of job/program shaping plugins/methods for flux.


Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/354#issuecomment-134299219

dongahn commented 8 years ago

I like it! I do think this can be viewed as a generalization of OMP4 place concept where a place can be specialized with different number of cores and thread binding can be controlled through different policies. This in a sense seems to raise the concept of place to more than a set of cores.

Now, did you think about ways to specify your task shape in a nested fashion as well? For example, if a users wants to have 4 task slots, each maps to a last-level IB switch, but still wants to bind tasks to only a set number of nodes under that switch, it seems a nested specification is necessary. It looks like this concept is general enough to support a more advanced task mapping but I thought I would just ask .

How about in the case where users just want a best effort mapping specifying "minimize the number of lower level switches being used as best as you can," did you think about if your components will allow them to express that?

Sorry for those far-fetched advanced specification questions... compared to what you want to accomplish in your short term MS. But I think this concept is general enough to cover a good basis.

grondo commented 8 years ago

We'd always talked about using a composite resource specification to match the composite RDL design. Thus for this case a task slot "shape" could contain a list of sub-shapes that also must match the resource set being matched. This should allow pretty generic hierarchical resource specification, and should allow a recursive resource match algorithm where necessary, e.g.

This is a summary of an idea we talked about awhile ago:

 {  node = 4 }

might match the first 4 nodes found anywhere, whereas wrapping in a request for a single switch could constrain this request to 1 switch:

 { switch = 1,
   { node = 4 }
  }

The resource match algorithm could simply look first at outer composite and search for all resource composites that satisfy exactly one switch. Then as a second step search the result for resource composites that have 4 nodes. This could be extended to return 1 socket for each node

 { switch = 1,
   { node = 4,
     { socket = 1, { core = 1 } },
     { socket = 1, { core = 1 } },
     { socket = 1, { core = 1 } },
     { socket = 1, { core = 1} }
   }
 }

Shorthand for lists like above could be developed later.

Disjoint resources like licenses don't fit into the "task slot" idea which worries me a bit. I guess they could just be appended to the list of task slots, but that feels like the model doesn't quite fit. If using pseudo specification as above, we could just append to the generic specification composite list, e.g.

 { cluster = 1, 
   ...
 },
 { license = 1 }

Any other disjoint and generic resource could possibly be specified in this manner.

This idea was discussed in detail early on in the project, but still may apply

lipari commented 8 years ago

FYI... The tree search code in flux-sched implements pretty much what @grondo just summarized. It's still under development and subject to improvement of course, but it follows the design we discussed early on.

grondo commented 8 years ago

Also, I'm wondering how task-slot list could work for shared resources like GPU (this is actually similar to license problem, but GPU is an in-tree resource instead of disjoint)? @trws, I have a feeling you have a plan for this as well and hoping you could elucidate. Thanks!

trws commented 8 years ago

As @grondo points out, this issue really isn't about discussing a new model, just shaping the design plan flux has already had toward a concrete model. The discussion brings up a couple of interesting points about how that model works. @grondo's examples are what's been planned since before I got here (and I apologize if I made it seem like all of this was coming just from me, that was in no way intended), and will definitely work, but I'm realizing that there is a bit of a disconnect between that hierarchical specification and my internal notion of a task-slot <-> task mapping where nested resources and multiple tasks are concerned. As long as there is a way to specify multiple tasks mapped to each task slot, this is not necessarily a problem, but it does complicate things, and I would be very interested to hear what your thoughts are on how that mapping should work.

I had been thinking that we would specify a number of task-slots, and simply map one task to each task-slot. That would be quite straightforward, and had significant appeal, but based on your questions here and @lipari's points on the flux-sched issue yesterday, I'm coming around to the idea that it isn't sufficient.

If we specify the number and requirements or shape or what-have you for each task slot, these can be used by the scheduler to allocate resources. Then a separate number of tasks can be distributed across those task slots based on a separate, or possibly merged with markup, specification processed by the remote exec system and/or the execution shell that has been discussed recently. That plan had been that the second specification would be relatively basic, tantamount to the distribution policies listed in the original post or their ilk, but I'm not sure anymore that that's going to cut it. The options I see for this are either to make the assignment logic more potent, such that it can express arbitrary mappings across sub-slots potentially along with soft containment, or to allow task slots themselves to share resources. We may need to do the former anyway, but the latter more naturally pulls in concepts like licenses.

We've discussed before that at some level the whole system is a graph, rather than a strict tree, and we may be able to exploit that if it doesn't make things too complex. For the example of a license, combined with say a requirement for four task slots with one node each containing two sockets with one GPU and four cores, we could use the specification as we have had it, and I think it would look about like this:

 { "task-slots" : {
   "count" : 4,
   "resources" : [
   {"$each" :  { 
     "type" : "node",
     "count" : 1,
     "children" : [{ "type" : "socket",
                        "count" : 2, 
                        "children" : [{ "type" : "core", "count" : 4},
                                          { "type" : "GPU", "count" : 1}]}]
   }
   },
   { "type" : "license",
     "count" : 1 }]
   }
 }

This expresses basically what we want I think, but doesn't tie the license to the slots in any way, nor does it naturally map to having anything shared between different task slots. We can do that with side information, or we can try something a bit different. What if we express it in terms of graph links, syntax is meant to be illustrative, not a proposal, it could look like json or whatever else?

{task_slot, count=4}-[uses]->{licence, count=1}
                |
         [contains]
                V
{node, count=1} -[contains]->{socket, count=1}-[contains]->{GPU, count=1}
                                                           |
                                                   [contains]
                                                         V
                                                   {core, count=4}

Being able to actually leverage the networks, or link types as I prefer to think of them, we could have different kinds of ties to a slot, where some may be sharable and others not. The switch example could be another "uses" type, or possibly "consumes" with information on how much bandwidth is used. Most things would just be "has" or "contains," asserting ownership over that given resource, such that the above would allocate the entire node to each task slot, but if the first link were "uses" instead the node would be considered only partially allocated, while the socket, GPU and cores would be allocated.

I have no idea if this made any sense, and I apologize for the long post, but it took me a lot more text than I expected to get this down...

@grondo, as to GPUs, there are a few points there we need to look at. It's definitely going to need to be handled in terms of the binding/affinity work, we'll also need to manage MPS for CUDA systems as well as masking off GPUs not allocated to a given job and ensuring that GPUs allocated to a given job have the correct exclusivity masks. As to specifying them, they can be treated as part of a heirarchical resource, nested under the slot level. I would lean toward requiring a user to allocate one full CPU socket, or at least a memory node if a socket has multiple memory nodes and we have sufficient topological information, to get the associated GPU, but if they've done that we can treat it like a core. My bias is to treat GPUs as exclusive resources rather than shared ones, because they cannot do transparent time-sharing between processes without going through a common MPS, which is an unacceptable security concern. I'm not sure if I've hit your question quite right, if so I'll try and be more precise where I can. This is also a place to collaborate with Edgar by the way, I have it on my list to chat with him about getting some of this utterly missing, and utterly necessary, functionality into the SLURM plugin he's working on as well. We'll be hurting, very badly, if we don't have good GPU management when the GPU clusters start to go open to the lab. Two processes hitting one GPU can take 3x as long as running them sequentially... not fun.

dongahn commented 8 years ago

I have a customer meeting and a group meeting back to back, but while we are on this subject, I'd love to flesh out some details for other more advanced job specification -- in particular for @SteVwonder's IO aware scheduling and @surajpkn's dynamic scheduling, going forward. This may belong to a separate issue number, in which case we can move the discussion.

specification for constrained scheduling

E.g., say we want the task-slot contains IO bandwidth as well (assuming for a moment there is a containment logic for this.)

task-shape = {'node' : {'num' = 1, 'has' = {'io-bw' = {'$gt' = 30MB}}}}

will do? With this specification, IO-aware scheduler will have to walk the IO hierarchy to make sure this IO bandwidth constraints can be met at each level in the IO switch hierarchy, but this isn't too different from doing a similar constraints check on the default hierarchy...

Specification for dynamic scheduling

In this case, a user somehow will specify min:max:other-count-constraint (square | cube |..) and have either aggregated wall clock time or power as the main match criteria.

task-shape = {'node' : {'num' = '1:10:none'}, 'agg_wclock' : '20h'}

task-shape = {'node' : {'num' = '1:10:none'}, 'agg_power_bound' : '200kw'}

Will do? Are these good enough specification for a dynamic scheduler or power-aware scheduler can find a match?

How should one specify a job as malleable...

grondo commented 8 years ago

I think this is great discussion.

If we specify the number and requirements or shape or what-have you for each task slot, these can be used by the scheduler to allocate resources. Then a separate number of tasks can be distributed across those task slots based on a separate, or possibly merged with markup, specification processed by the remote exec system and/or the execution shell that has been discussed recently. That plan had been that the second specification would be relatively basic, tantamount to the distribution policies listed in the original post or their ilk, but I'm not sure anymore that that's going to cut it. The options I see for this are either to make the assignment logic more potent, such that it can express arbitrary mappings across sub-slots potentially along with soft containment, or to allow task slots themselves to share resources. We may need to do the former anyway, but the latter more naturally pulls in concepts like licenses.

I totally agree with this paragraph. Though I would say that the original plan did not have basic task distribution polices, but stated that mapping functions could be registered with an instance, basically in a hand waving fashion allowing arbitrary task distribution among resources in the instance. This function could take task-slot shape and tasks as you've stated above, or could be a more simply applied task layout function provided by a user and keyed off job name.

The thing I can't quite get my head wrapped around is that when requesting resources, it sounds like users would always have to map the request onto task-slots and tasks semantics, whereas this doesn't quite make sense for me when you are requesting resources to run an instance, diverse workload, or other request that isn't at the leaf of a hierarchical request.

I suppose you could always request 1 task slot for program-as-instance and have it be any kind of complicated request you want. Then you can map the resulting allocation to a program request that splits the single task slot into 1 task slot per node, each of which gets a broker.

surajpkn commented 8 years ago

@dongahn by aggregated walltime, you mean sum of the walltimes of min and max? That would not be useful for scheduling. Right now I don't have an extraordinary idea, but these are the general requirements/considerations for supporting different job types.

  1. Moldable: User must be able to say Number of nodes and walltime for each number. (nodecount1, walltime1), (nodecount2, walltime2), etc.
  2. Malleable: This strongly depends on the scheduling technique. but in general a. User could specify a min, a max, and a step. b. For walltime, practically speaking, its not possible to specify walltime for each step between min and max. And its not useful as well for scheduling. User must specify a walltime for min number of nodes.
  3. Evolving: There are actually three types of evolving jobs - unpredictable, partially predictable (can predict either the number of nodes or when the job will evolve) and fully predictable. a. If the job is unpredictable, the request looks like rigid. b. If the job is (partially or fully) predictable, user must be able to hint that. This is a hard to design right now.

IMHO, a key point from above is that there will be many different node counts (depending upon the job type) and each may or may not be be associated with a walltime (e.g., max node count of a malleable job need not necessarily be associated to a walltime while the min node count must be). So I think its the best when node count and walltime are not separated too much.

dongahn commented 8 years ago

@surajpkn, My thinking was: The aggregate wall-clock time the job will use stays constant, and the number of task slots that the scheduler will allocate can change for modability. So a job that asks for 2-node hours can be molded to 2 hours on 1 node or 1 hour on 2 nodes each. This of course should be subject to max, min and other task-slot-count constraints (since a job might require certain number of cores.

I thought we could also extend the same concept when users want to use the power bound as the main allocation. E.g., 200kw might be the power bound that a job can use, then the scheduler can mold the job to a different task-slot configuration.

  1. So in this case, walltime 1 != walltime 2 so an aggregate walltime as a constant does not work. In principle, a modable job would require the same area in the scheduling box (time x task-slots space) though the shape of a rectangle would be different. Do you know why users want to choose different walltime for different configurations?
  2. Could you elaborate a bit more on the step concept?

I see, so the best spec granularity is to attach a wallclock (or other similar quantiy -- power bound) to each configuration?

surajpkn commented 8 years ago

@dongahn The definition of a moldable job does not impose a constrain that a area of a moldable job in space-time be constant for different configurations. So moldability within node-hours limits moldability only to applications that perfectly fit in that constraint.

Step concept as in: if a job is completely malleable between min and max, then the step is basically 1. If its malleable between min=2 and max=10 but can only work with 4, 6, or 8 and not 3,5,7 or 9, the step is 2. Well in fact, an application can be malleable between 2 and 10 but can take only 4 and 8, not 6. But these are too many details. A min and max itself should be enough for what the world has now.

Yes, attaching walltime to each configuration and in a flexible way (because some configs don't need to have walltimes) would be the ideal. That would give room for submitting all types of jobs in more or less the same way.

dongahn commented 8 years ago

@surajpkn Thank you for the clarification! I think this is really helpful -- we can see if our job specification model can support as much of our work as possible. In the similar vein, @SteVwonder, do you have any feedback as to how one can specify IO BW as a constraint into this?

trws commented 8 years ago

I'm finding that I'm a bit conflicted now on how this should work. It might be good to find a time to all meet and discuss what we want to be able to express, and how to make it all fit. As I see it, we can do everything with a per-task resource description, but only if we include the ability to annotate those with non-owning connections to be able to express groupings. On the flip-side we can express everything in terms of the shape of the overall job and then use annotations to guide mapping of tasks into the aggregate. The latter bugs me a bit, because it feels like it over constrains the system, as is the case with specifying number of nodes and tasks as opposed to tasks and cores per task. Even so, the other also has it's issues, maybe there's a more comprehensive way to handle it overall.

On 27 Aug 2015, at 10:15, Mark Grondona wrote:

I think this is great discussion.

If we specify the number and requirements or shape or what-have you for each task slot, these can be used by the scheduler to allocate resources. Then a separate number of tasks can be distributed across those task slots based on a separate, or possibly merged with markup, specification processed by the remote exec system and/or the execution shell that has been discussed recently. That plan had been that the second specification would be relatively basic, tantamount to the distribution policies listed in the original post or their ilk, but I'm not sure anymore that that's going to cut it. The options I see for this are either to make the assignment logic more potent, such that it can express arbitrary mappings across sub-slots potentially along with soft containment, or to allow task slots themselves to share resources. We may need to do the former anyway, but the latter more naturally pulls in concepts like licenses.

I totally agree with this paragraph. Though I would say that the original plan did not have basic task distribution polices, but stated that mapping functions could be registered with an instance, basically in a hand waving fashion allowing arbitrary task distribution among resources in the instance. This function could take task-slot shape and tasks as you've stated above, or could be a more simply applied task layout function provided by a user and keyed of job name.

The thing I can't quite get my head wrapped around is that when requesting resources, it sounds like users would always have to map the request onto task-slots and tasks semantics, whereas this doesn't quite make sense for me when you are requesting resources to run an instance, diverse workload, or other request that isn't at the leaf of a hierarchical request.

I suppose you could always request 1 task slot for program-as-instance and have it be any kind of complicated request you want. Then you can map the resulting allocation to a program request that splits the single task slot into 1 task slot per node, each of which gets a broker.


Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/354#issuecomment-135496144

SteVwonder commented 8 years ago

Really the key piece that IO BW needs is temporality (which will also benefit evolving jobs). That way you can essentially pre-plan your checkpoints (thus relaxing the burst buffer constraint). e.g. at time A, this application will read X MB/s, at time B, this application will read/write 0 MB/s, at time C, this application will write Y MB/s.

If temporality is too complicated, simply X MB/s of BW per unit (node, core, etc) will work for what we are already doing with our IO-Aware scheduling. It seems that this would be pretty easy to integrate into what @trws has already suggested.

{ "task-slots" : {
   "count" : 4,
   "resources" : [
   {"$each" :  { 
     "type" : "node",
     "count" : 1,
     "children" : [{"type" : "IO BW",
                    "count" : 5000 },
                  { "type" : "socket",
                    "count" : 2, 
                    "children" : [{ "type" : "core", "count" : 4},
                                  { "type" : "GPU", "count" : 1}]}]
   }
   }
}
dongahn commented 8 years ago

As I see it, we can do everything with a per-task resource description, but only if we include the ability to annotate those with non-owning connections to be able to express groupings. On the flip-side we can express everything in terms of the shape of the overall job and then use annotations to guide mapping of tasks into the aggregate. The latter bugs me a bit, because it feels like it over constrains the system, as is the case with specifying number of nodes and tasks as opposed to tasks and cores per task.

For the latter, if we allow user to specify the task mapping functions only down to the necessary level (from the outermost task slot shape to a sub task slot shape you need), we can perhaps get away with the overconstraining problem?

If the user wants two switches and use any available cores under them to map N tasks, the user only needs the top-level mapping function to distribute the tasks across two switches and have the scheduler pick any available cores from under each switch.

If the user needs to confine the tasks to some set number of nodes from each switch, then the node - specialized sub task slot shape will be specified and along with a second level task mapping function...

Should a resource like license really need to fit into the task slot? Licence is not typically not associated with a task but a job as a whole. Perhaps this is just a coallocation of a consumable resource.

Probably too late of a night for a good comment...

grondo commented 8 years ago

Should a resource like license really need to fit into the task slot? Licence is not typically not associated with a task but a job as a whole. Perhaps this is just a coallocation of a consumable resource.

I think this comment speaks to the confusion we're having as to the meaning of assigning resources to "jobs" vs "individual tasks within the job". If we use the task slot shape * ntasks approach (which I really like for actually running work) for all jobs, then yes we'd have to be able to assign resources that are normally assigned to a job as a whole, like licenses, to a task slot shape somehow because that is the only way we've allowed users to specify resources. I think Tom has proposed some ways to make this workable above.

In the case of binding, a license does not need to fit into a task slot. However, it must (or must have the ability to be) part of the distributed container that is the job so that all tasks have access to the license(s) (assuming we have a way to constrain them to a container). (as an aside, there are possibilities of constraining access to licenses to jobs by modifying firewall rules at runtime, sending requests to license servers, adding license files to local namespace, etc...)

From a practical perspective containment of jobs is separate from affinity and binding of resources. This means that during launch of a program, whether it be a new instance or other parallel program, the container will be created first by the "shell" on each node. This essentially sets up a nice hierarchy of distributed containment in a flux job heirarchy (each child job is a container within the parent's container). Within a job container, CPU and memory affinity can be assigned per launched task and should be thought of as advisory, with user options on how that affinity is assigned. The key concept here is that resources are constrained at the job level and there should be no obligatory containment below that level unless a child job is started on a subset of that container.

I'm not certain that last paragraph was on-topic, but it seemed like it was a good time to explain it.

I kind of find @trws's 'graph links' example above interesting, but I definitely don't think I followed it completely. It would be interesting to explore that further, perhaps with some examples?

The only place I'm still having trouble, btw, is deciding how "task slot" concept applies to the creation of new instances. For example if the center was scheduling a new DAT that was to run across some resource set for a given time, it seems really strange to force them to describe in terms of task slots. Perhaps for these cases we'd use degenerate case of a single task slot? (Then the name of the thing seems a bit strange...)

trws commented 8 years ago

@grondo is hitting some of the points I've been circling around over the last day or so. There are a few concepts here that I at least have been unintentionally conflating into one.

Somehow we need to be able to express both groupings and hierarchical allocations, and they're a bit at odds with one another. As you say @grondo some examples will probably help, so I'm going to throw out some specifications of job shapes I expect people to want to run, and we can think about how to actually express these.

  1. 10 tasks, each task requires one core, there is no required grouping between cores: an MPI job with 10 ranks
  2. 10 tasks, each task requires four cores, all four cores for each task must be on the same node: an MPI+OpenMP job that will run a minimum of four threads per rank
  3. 10 tasks, each task requires one core, all must run on the same node, node need not be exclusively allocated: A shared-memory process-parallel job, this could be an MPI test job that uses shared-memory for in-node communication for example, more likely a server or daemon of some kind
  4. 1 task, requiring four nodes: A distributed database, or non-flux-PMI-compatible MPI job with an mpiexec
  5. 8 tasks, run on exactly two nodes: I include this because I think people will insist on doing it, though I think this is not what we should encourage
  6. 8 tasks, each requiring a socket, but with two tasks to each such socket: Again, I don't like this one bit, but I expect people to want to do it
  7. 10 tasks, each reqiring one core, all of which will share a single allocation slot of a license

I'll put together some examples of how I think we could specify these, but I'm having a little trouble coming up with a way to specify all of these cases cleanly.

By the way, on the subject of new instances, it could be a single slot that gets allocated in which a bootstrap is run, but since the parent will be used to execute the brokers, each broker can be seen as a task in and of itself. When launching with slurm, it sees each broker as a task as it is, there's no reason flux can't do the same. The only time we can't do that is with the root instance, where we'll have to use an rsh tree or similar to fake it out. That may not cover the entire issue, but it's where my brain has been on the subject.

grondo commented 8 years ago

...but since the parent will be used to execute the brokers, each broker can be seen as a task in and of itself. When launching with slurm, it sees each broker as a task as it is, there's no reason flux can't do the same. The only time we can't do that is with the root instance, where we'll have to use an rsh tree or similar to fake it out. That may not cover the entire issue, but it's where my brain has been on the subject.

Yes, an instance is just a program of which each broker is a task. One reason I like the "task slot" concept here is that, when launching an instance, you just describe the task slot for brokers as a whole "node" (where node is the set of resources from the Node object that are available to the program). If, for testing, you want to run a broker per socket, you can do the same thing equivalently but give a Socket as a task slot.

The nice thing about this is we get away from treating Nodes as something special.

trws commented 8 years ago

Agreed. The only place we'll have to keep treating nodes as special is for handling programs that depend on shared memory. For the broker we'll probably have to have a check that ensures one broker gets at most one node to manage, unless we add a mechanism for it to launch work on siblings.

On 28 Aug 2015, at 8:53, Mark Grondona wrote:

...but since the parent will be used to execute the brokers, each broker can be seen as a task in and of itself. When launching with slurm, it sees each broker as a task as it is, there's no reason flux can't do the same. The only time we can't do that is with the root instance, where we'll have to use an rsh tree or similar to fake it out. That may not cover the entire issue, but it's where my brain has been on the subject.

Yes, an instance is just a program of which each broker is a task. One reason I like the "task slot" concept here is that, when launching an instance, you just describe the task slot for brokers as a whole "node" (where node is the set of resources from the Node object that are available to the program). If, for testing, you want to run a broker per socket, you can do the same thing equivalently but give a Socket as a task slot.

The nice thing about this is we get away from treating Nodes as something special.


Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/354#issuecomment-135812940

grondo commented 8 years ago

@trws, your list above is great. However, as I said before I don't think all resource requests will fall into cases where the user can cleanly map to a "number of tasks". I think we'll want some examples where a user wants:

  1. 4 nodes on cluster X (or sockets, or cores)
  2. 4 nodes, each with a GPU

Think about the case of the ATS workload, or some other workload with varying numbers of tasks...

trws commented 8 years ago

I see what you mean @grondo, and those are good additional cases we should look at.

That said, I think we're talking past each other a bit on definitions. Since I'm defining task as a process launched directly by flux, or a process group as this logically includes all processes forked from the child, an ATS workload is actually one task as far as the parent instance is concerned. That single process may subsequently launch a great deal more work, but it itself is only one, as is any batch-script style workload. ATS might be a special case though in that I think the recommended way to launch it will be flux start <shape> ats <ats-script> based on looking over the ATS code. Then the task spec will be for the instance, and the ats script will be run as a single task initial command inside the instance.

Does that make any more sense?

grondo commented 8 years ago

Also, one last comment for now. I don't think we need to handle 'hierarchical allocation' specially, though I could be missing something.

Conceptually, if you allocate resources at the leaf nodes of a hierarchical request, it would seem you can naturally express whether you want to allocate "all" of a composite resource or not. That is, when the scheduler allocates a resource it finds an "exact satisfying set" and allocates all of that to the request.

As a stupid example, if you had a cluster with a single node with a single core, and and the user requests { core = 1 }, the resource search would start at the root of the tree and find that the "cluster" resource exactly satisfies the minimum request. There is no need to proceed farther into the composite object, we just allocate that object (including its children).

If you have a more typical cluster with hundreds of nodes with 4 cores each, the scheduler will examine the cluster composite and see an aggregate of { cores = 400 }. Since this more than satisfies the request, the scheduler then examines each node and finds the first has { cores = 4 }, we descend again into the node object and finally find a Core that has { cores = 1 } and allocate that object (which also takes its parents up to root of tree, so you allocate a composite resource with { cluster = 1, node = 1, cores =1 }.

If you want to exclusively allocate a node, you could therefore just express { node = 1 } or if you wanted nodes with 4 cores { node = 1, cores = 4}. To get multiple nodes with different number of cores you express this as a list. If you want to guarantee 4 cores on 1 node, but not exclusive access to the node you could do a nested request { node = 1, { cores = 4 } }.

This is probably all very naive, but I thought I'd get down in the issue how I've been conceptually thinking about the exclusive and non-exclusive cases...

lipari commented 8 years ago

RFC 4 is the place we've attempted to formalize @grondo's last example as well as some of the earlier discussion.

grondo commented 8 years ago

That said, I think we're talking past each other a bit on definitions. Since I'm defining task as a process launched directly by flux, or a process group as this logically includes all processes forked from the child, an ATS workload is actually one task as far as the parent instance is concerned.

I was talking about running the ATS working under a new instance. I kind of assumed the common case for these type of workloads would be to launch them under a new instance to make use of custom scheduling, take load off the parent scheduler and kvs, etc.

So I was talking about how to write resource request for the ATS workload as instance.

BTW, I want to apologize for being really dense. It seems like I'm kind of stuck in old thinking habits and it is taking me awhile to understand your terminology and points. :-( I still think the discussion is valuable for me and hopefully others too ?

trws commented 8 years ago

Thanks for that @lipari, that RFC lists all the requirements quite nicely. The question, at least in my mind, is how do we express the user's intent to feed that traversal. If we say that resources defined in order at the same level of the request are all treated the same, and only the inner-most nesting level is allocated, that would be one way to handle it. That said, if I saw { node = 1, cores = 4} and { node = 1, { cores = 4 } } without explanatory context I would either assume they were the same or that one was a typo. I agree wholeheartedly that doing allocations as you suggest @grondo is the way to go, especially in that doing the resource search as that kind of tree walk has all kinds of nice search and early-termination properties. I'm just hoping we can find a relatively self-documenting way to write the queries. This is actually why I like the graph links as a concept, perhaps I'll just co-opt some graph database syntax and give some examples...

By the way @grondo, I think this is all useful, and you're not being dense at all. We're having trouble finding a common glossary and syntax to discuss this problem, because it's new and we all have different reference points that give us different internal representations when we read the same words. (or maybe not all of us, but certainly I seem to be having that problem...)

grondo commented 8 years ago

Yeah, agreed we need a better syntax than what I proposed above. That is why I was kind of interested in your graph links idea above.

And thanks for taking pity on me :-)

hautreux commented 8 years ago

Sorry to enter that discussion a bit late, but what a discussion, I am really impressed !

I am also thinking that the task-slot logic is a great idea. It enables to properly express the resources required by each task and ensure that each one will get what is necessary for its workload. Having the capability to launch more tasks on each task-slot can also be valuable providing a sort of overcommit logic that can be valuable in some condition.

Task-slots also ease the mapping of tasks across the allocated resources taken as a whole for the job. In a sense it directly provides the first mapping function than can be used to properly split the globally allocated resources for the job among the tasks. Then you just have to add (optional) additional (ex: hierarchical) mapping descriptions and then the ranking logic and you are almost all set.

The only thing that concerns me a bit is the issue expressed by @grondo concerning the following resources requirement : 4 nodes on cluster X (or sockets, or cores) 4 nodes, each with a GPU

Do you plan to have a job request being a composite of { task-slot, number of tasks} elements or is a job only made of one task-slot description applied to all the tasks plus additional resources (like licenses) requested directly at the job level ?

Having requests made of multiple {task-slot, number of tasks} elements (what I would call a task-group), we could express asymetric resources requirements for tasks more easily as well as shadow tasks that would just accumulate resources (like licenses) without really attaching them to particular tasks. We could have for example :

{ 
  "task-group" : {
     "name" : "A",
      "task-slot count" : 4,
      "task-slot" : {
          "ntasks" : 1,
          "resources" : [
           {"$each" :  { 
            "type" : "node",
            "count" : 1,
             "children" : [{ "type" : "socket",
                                     "count" : 2, 
                                     "children" : { "type" : "core", "count" : 4},
                                  ]}]
        }
   },
  "task-group" : {
     "name" : "B",
      "task-slot count" : 4,
      "task-slot" : {
          "ntasks" : 1,
          "resources" : [
           {"$each" :  { 
            "type" : "node",
            "count" : 1,
             "children" : [{ "type" : "socket",
                                     "count" : 2, 
                                     "children" : [{ "type" : "core", "count" : 4},
                                          { "type" : "GPU", "count" : 1}]}]
        }
   },
  "task-group" : {
     "name" : "C",
      "task-slot count" : 1,
      "task-slot" : {
          "ntasks" : 0,
          "resources" : [
           {"$each" :  { 
            "type" : "license",
            "name" : "lic_name",
            "count" : 1,}]
        }
   }
}

This would ask for a job having :

This is extensible to any number of task-groups enabling to factorize resources requirements per type of tasks. Then it can be transposed into the global resources requirements for the scheduler and map back again to the tasks layout (at least a first level) among the resources at launch time. One interest is that you can then launch programs (as defined in RFC8) using one, some or all the task-groups of the job. (useful for MPMD or co-scheduled distributed apps).

Hoping that it makes some sense...

trws commented 8 years ago

It does, and thanks for chiming in @hautreux, this is good feedback! My preference at least was to have it require at least one task slot description, the default, and a number for those, but then allow multiple additional slot descriptions if desired. So the task-group could just be multiple top-level task-slot specifications. We could certainly do it as a flat list, but for some reason I like the ability to give it a default then override where necessary a bit better than always specifying a list. Maybe we should try both, at least in terms of writing up example specs, and see how they look?

grondo commented 8 years ago

Great, I think this is starting to come together! I was thinking the underlying specification would support a list of task-slots, with efficient sytactic sugar on top to allow easily creating lists of identical task slots. That being said, I'm not at all opposed to @trws's idea of a default with overrides where necessary. I'd be interested in seeing examples of these.

@hautreux, with one comment you are adding a lot to the discussion!

grondo commented 8 years ago

BTW, can someone give me some background on the "$each" operators everyone's using? I feel like I'm left out of the party. :-)

(Is it coming from MongoDB?)

trws commented 8 years ago

I'm not sure if Mongo actually has that operator, but I used it as a convenient construct whose meaning was (hopefully) easy to infer. =)

hautreux commented 8 years ago

I do not exactly see what you mean @trws with a default with overrides so I would be interested by some examples too. I guess that as long as there is a way to express the asymmetry of resources requirement that the various set of tasks can have, I will think it is okay.

Concening the $each, I am in the same situation, I only did a cut/paste on that :)

trws commented 8 years ago

So, what I was referring to earlier, with respect to treating it like a graph, was going back to the concept that every resource is a node in a graph that can have multiple links to other resources each of which can have a type. This can also be said that there are a number of networks where resources can be members of various networks, they're the same thing with slightly different semantics. Anyway, I'm just going to go for it and see where this ends up.

Say a basic job spec definition with a resource spec for one core per task, specifying everything explicitly might look like this:

tasks: 5
job-resources: #job level resources, I'm adding this because I couldn't see a way around it, a list of job-level resources that can be referenced from the rspec
  - id: 17
    type: license
rspec:
    type: Core
    count: 1
    tasks: 1 # defaults to one, meaning one of these rspecs per task, to get one total, use *
    sharing: exclusive
    contains: []
    links:
      - type: uses
        direction: out
        target: 17
rspec-override: []

The only extra thing, is that when the type is "Group," that means it is a logical resource there to link to or group other resources but does not itself represent a resource. Links is a list of pairs of link type and target, a link of type has-a is equivalent to listing the target under the contains list, which is just a convenience to save typing. The id field is a unique identifier that maps only to resources in the job spec, it's here for back-references and to ensure that multiple resource specs that refer to a given entity get the same entity.

Examples:

1) 10 tasks, each task requires one core, there is no required grouping between cores: an MPI job with 10 ranks

tasks: 10
rspec: Core

2) 10 tasks, each task requires four cores, all four cores for each task must be on the same node: an MPI+OpenMP job that will run a minimum of four threads per rank

tasks: 10
rspec:
  type: Node
  sharing: shared #global sharing, self or other jobs, by default all non-leaf levels are shared
  contains:
    - type: Core
      count: 4  #exclusive by default

3) 10 tasks, each task requires one core, all must run on the same node, node need not be exclusively allocated: A shared-memory process-parallel job, this could be an MPI test job that uses shared-memory for in-node communication for example, more likely a server or daemon of some kind

tasks: 10
job-resources:
  - type: Node
    id: 1
    sharing: shared
rspec:
  type: Core
  count: 1  #exclusive by default
  links:
    - type: has-a
      direction: in 
      target: 1  #all cores allocated must have an in-coming has-a link from the same node

Possible alternate, using an explicit marking of tasks at multiple levels:

tasks: 10
rspec:
  type: Node
  tasks: *
    type: Core
    count: 1
    tasks: 1

4) 1 task, requiring four nodes: A distributed database, or non-flux-PMI-compatible MPI job with an mpiexec

tasks: 1
rspec: 
  type: Node
  count: 4

5) 8 tasks, run on exactly two nodes: I include this because I think people will insist on doing it, though I think this is not what we should encourage

tasks: 8
rspec:
  type: Node
  count: 2
  tasks: * #figure out how many to put here, rely on mapping

6) 8 tasks, each requiring a socket, but with two tasks to each such socket: Again, I don't like this one bit, but I expect people to want to do it

tasks: 8
rspec:
  type: Socket
  tasks: 2  # top tasks % tasks must == 0

7) 10 tasks, each reqiring one core, all of which will share a single allocation slot of a license

tasks: 10
job-resources:
  - type: License
    id: 1
    sharing: job-exclusive
rspec:
  type: Core
  links:
    - type: uses
      direction: out
      target: 1

8) 4 nodes on cluster X (or sockets, or cores)

tasks: 4
rspec:
  type: Node
  links:
    - type: has-a* # * indicating any number of intervening links of this type may be traversed to satisfy
      direction: in  # incoming has-a links are non-owning, but grouping
      target: cluster_X #global name allowed in place of ID where such exists

9) 4 nodes, each with a GPU

tasks: 4
rspec:
  type: Node
  contains:
    type: GPU

I'm not sure how much I like these, my opinion of some of it changed even while putting this together, but at least it's a place to start. Having a way to specify how tasks are divided at multiple levels seems to help a lot, without that this looks a great deal messier.

trws commented 8 years ago

Oh, one quick extra one for the default with overrides case:

10) 5 tasks, first four each use one node, last uses 4 nodes

tasks: 5
rspec: Node
rspec-overrides:
  - task: 5
    rspec:
      type: Node
      count: 4
grondo commented 8 years ago

It will take me a bit to digest this, but @trws I think you are on to something here. You might have to explain the links: idea to me in more detail, though. It feels awkward to me to specify more generic part of a request inside of the more specific part (e.g. the 4 Nodes on cluster_X) (E.g. compare to OAR examples which I find much more natural)

I like the yaml markup a lot better, I think it would be even more readable if tags or strings could be used instead of ids for references?

trws commented 8 years ago

I agree, the links thing feels odd where it is, and specifying them in all the separate pieces feels especially painful. Doing the type:ntasks thing makes it much cleaner, and maps more to what OAR does, but it doesn't solve problems where we need links to things that aren't hierarchically nested. As to strings/tags, I agree names are better wherever possible, so long as there's a way to specify a new identifier to a group or object inside a job-spec, without that it would be impractical to specify certain types of relationships portably.

For links, maybe we could co-opt some of the neo4j cypher syntax?

tasks: 4
rspec:
  type: Node
  links:
    - <--[has-a*]--(cluster_X)
    - (self)--[has-a]-->(Core: {count: 2})

The cluster example could also be re-worked in the OAR style:

tasks: 4
rspec:
  id: Cluster_X
  tasks: *
  contains:
    - type: Node

Where here id actually ties out to the definition for the system, and the implicit task 1 on the inside allocates one node per task in that cluster... This clearly needs some work, but this looks more palatable to me already.

lipari commented 8 years ago

@trws, thank you for pulling this together. There is a good deal of overlap with the flux-sched/resrc implementation. I could think of ways to support most of your scenarios above. The flux-sched/resrc resource tree construct relies on the composite nature of the composite resource pool described in RFC 4. Hence, it would be easier to support the alternate representation of item 3 above than it would the first version that includes a link. Similarly, in item 8, I would propose requesting a resource type of "cluster" with the name "cluster_X" (shared) that has 4 nodes (exclusive or shared).

lipari commented 8 years ago

Another idea... While you propose a default of one task per resource spec, you could also establish a default of one task for the job spec definition. Then it could effectively serve as a traditional resource request for a job that will launch multiple programs of various task counts (like ATS and UQ), where the notion of task count is irrelevant. To be redundantly clear, a parent job requests just 4 nodes (defaults to 1 task), but this launches 1000's of child programs over time, each stipulating different task count/rspec's per the above plan.

dongahn commented 8 years ago

I didn't have a chance to review this discussion thread. But what @lipari describe is also important for a tool session. Think of this as batchxterm or mxterm.

trws commented 8 years ago

@lipari I'm not sure how that differs from the capability used in (4). Could you perhaps put together an example? We certainly need to be able to do that, but I would rather have it be easy to request fine-grained allocations and possible but less easy to request a monolithic aggregate than the other way around. That way we get more information in the simple case rather than the rare case.

lipari commented 8 years ago

I am proposing modifying scenario (4) to allow tasks to be omitted, and the job spec request still be valid. Indeed it does lower the barrier to requesting a monolithic aggregate. However, that allows us to support a job described in Issue 339. The job spec for that example would request at least 40 nodes, and the child programs that launch subsequently have task counts of varying sizes and duration, and fit into available resources as they become available. At some point, the 40 node program will run across all the job's resources (assuming the request was for exactly 40 nodes).

grondo commented 8 years ago

I would rather have it be easy to request fine-grained allocations and possible but less easy to request a monolithic aggregate than the other way around. That way we get more information in the simple case rather than the rare case.

Actually I'm not sure we've established that the ntasks * rpsec usage would even be a more common case. This would require that every "job" request in the queue is a simple parallel program execution, whereas I have a feeling that there will be at least as many queued jobs that are requests for instances with an "initial program" or batch script. (I have no proof of this of course, and this is just a feeling)

The reason for this feeling is at least the following

The language to express these kinds of request should therefore be easy, but powerful.

grondo commented 8 years ago

I guess to be even more clear, if we imagine that the real-world flux job hierarchy will be flat, then I would agree that most job requests will contain a known number of tasks to run. If however, we imagine that the job hierarchy will be 1, 2, or more levels deep, then we could say that (certainly at the root and internal job instances of the hierarchy) most requests may not have a known number of tasks, but may be requesting resources in the aggregate.

I guess the main point here is that, much like in all resource managers before it, there will likely be a usage differentiation between making a request for program that is a new instance, vs a program that is a simple parallel application. In the case of a new instance request, the common case by far will be a request for resources in the aggregate (possibly moldable/malleable), in the case of a parallel app, the request will (almost) always be a number of tasks and a task-slot or rspec.

trws commented 8 years ago

The case described in #339 for 40 nodes, assuming the syntax and model that I've been assuming, would be submitted by something like this:

flux start --rspec "{type:Node, count: 40}" ats <ats file to load>

When running an instance, it is not a single-task entity, each broker is a task. If a user wants to do what they would currently use sbatch for, by which I'm considering anything that would employ slurm job steps, they need a flux instance to support the equivalent of a "job step" anyway, so they would do the same thing wouldn't they? There may be a flag on start, or alternately on submit, to make it asynchronous or an instance respectively, but either way an instance is a multi-task thing, always.

In the further, hopefully unlikely, event that a user wants to run a single program that will consume multiple nodes, say an MPI job that has to use an MPI that flux doesn't directly support and uses mpirun with rsh as a result, that would be more like this.

flux submit -t 1 --rspec "{type:Node, count: 40}" mpirun <some crazy mpi thing>

It's entirely possible that I'm just missing how common the last case will be, but as I see it anything that would currently use job-steps and anything that is a sub-instance is not an aggregate request, at least it does not need to be. Are there a large number of users doing sbatch <single thing with no job steps but multiple nodes>?

grondo commented 8 years ago

@trws, no I understand your point. A new instance is a parallel program you are right. However, from the user point of view they are thinking about the resource requirements of their workload that will run under the instance, they should not have to know about instance launch semantics (though maybe you are proposing they should?)

Actually this feels a bit like splitting hairs at this point...