flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
162 stars 49 forks source link

Job-shape specification #354

Closed trws closed 7 years ago

trws commented 8 years ago

This ties back to the job submission API #268, shell based exec #334 and RFC 8.

We have a concept that I'm finding I really like, that being the task slot, but we haven't specced out how users will specify the shape, number and distribution of task slots, nor how their tasks will map to the aforementioned task slots. This issue is for discussion on the matter and an initial implementation.

At least for me, I'm finding this a much more digestible concept than the classic node/task count used in other places. Enough so that I propose we build everything around number of tasks, a default task slot shape, and specializations on individual task slots or task mappings as necessary. Reason being, it's completely unambiguous what every task gets, and the scheduler/system can place them as needed, as opposed to having to determine how to jam tasks into nodes without explicit context. This should simplify containerization etc. and we can provide the other interface style on top of it easily enough by defining a mapping.

This is what I'm targeting now:

Components:

I'm going to put together a prototype interface in capacitor for this today so we can see how we like it, but I'd like to hear everyone's thoughts on the matter. @dongahn, @garlick, @grondo, @lipari

For a quick summary of the discussion that follows, jump to:

https://github.com/flux-framework/flux-core/issues/354#issuecomment-225686820

lipari commented 8 years ago

While this discussion will be of value for posterity, it amplifies nuances that would have probably been better discussed in person. In this case, the example you presented:

flux start --rspec "{type:Node, count: 40}" ats <ats file to load>

is the point I was going for - a resource spec devoid of any task info. However, in addition, I don't think the user should have to request a number of brokers or a task count to reflect the number of brokers that will be instantiated. I think it will be safe to assume a 1:1 relationship between ranks and nodes, and provide the ability to override that default. Hence the flux-start example would automatically instantiate a broker on each of the 40 nodes. As to the likelihood of running mpirun across a set of nodes, but only one task, I agree that this will probably be rare.

grondo commented 8 years ago

While this discussion will be of value for posterity, it amplifies nuances that would have probably been better discussed in person.

Not sure I agree. We have more stakeholders viewing this thread than can just meet in person. I also find seeing the written examples helpful, and like to have time to read and respond.

Also, I think in Tom's example "{ type:Node, count: 40}" the count is the task count.

grondo commented 8 years ago

BTW, that is a nice example:

flux start --rspec "{type:Node, count: 40}" ats <ats file to load>

Some questions it brings to mind:

trws commented 8 years ago

@grondo That's a good question. I'm not sure how we would handle that case in flux-start now that you mention it, but the naïve way might not be so bad, just run one broker on each node from which a resource has been allocated. There's no reason flux-start can't be defined to take a single aggregate in that fashion, in fact the given example assumes that to simplify it. That said, giving it a finer-grained spec isn't necessarily a problem either.

Options:

Take an aggregate and run brokers appropriately as a side-effect:

flux start --rspec "{type:Core, count: 256}" ats <ats file to load>

Take the shape of the jobs the instance is likely to run, somewhat like slurm does with -n on sbatch:

flux start -t 256 --rspec Core ats <ats file to load>

Not sure we want to even consider supporting this, but maybe specify that only one broker should be run for each four allocated nodes, listing the other nodes in the allocated resource pool but only directly running tasks on broker nodes, allocating others with a flag perhaps?

flux start -b 20 --rspec "{type:Node, count: 4}" ats <ats file to load>

For run or submit the meaning of an aggregate would be to run one task only, unless a user specifies otherwise, in which case the default behavior would be to give each task one such aggregate.

lipari commented 8 years ago

Just seeing @trws's response now, but here's what I thought off the top of my head. I see similarities to Tom's reply.

How would this request be sent to the scheduler queue to be run later?

flux submit --rspec "{type:Node, count: 40}" ats

where count refers to the number of Nodes. Then at run time, this would be invoked:

flux start --size=40 ats

What if you knew only that ats needed 256 Cores and the number of Nodes is irrelevant?

How about this?

flux submit --rspec "{type:Core, count: 256}" ats
flux start --size=256 --distribute_1_rank_per_core ats
trws commented 8 years ago

It's not really splitting hairs, it's just being precise about what something means. Lets take slurm for example:

  1. -n 5 run 5 tasks, each on one node for some reason
  2. -N 5 run 1 task on 5 nodes
  3. -N 5 -n 5 run 5 tasks, each on one node
  4. -N 1 -n 5 run 5 tasks on one node
  5. -N 5 --ntasks-per-node 5 run 25 tasks on 5 nodes

In this model, the resources act as a container for the tasks and limit them in general. As @lipari has pointed out, a user can specify resources per task in some ways also, but that's not the default model. What I'm proposing is to have resources requirements be replicated, one per task, by default. If a user wants to use the resources as containment or for oversubscription, it can support that, but that's not the default. So the above would become:

  1. -t 5 run 5 tasks, each on one default resource unit, probably a core or node
  2. -r '{type:Node, count:5}' run 1 task on 5 nodes
  3. -t 5 -r Node run 5 tasks, each on one node
  4. -t 5 -r '{type:Node, tasks: *}' run 5 tasks on one node
  5. -t 25 -r '{type:Node, count: 5, tasks: *} run 25 tasks on 5 nodes

Perhaps even adding a --tasks-per=<resource type>=<num per>,... would make it cleaner, like this:

  1. -t 5 run 5 tasks, each on one default resource unit, probably a core or node
  2. -r '{type:Node, count:5}' run 1 task on 5 nodes
  3. -t 5 -r Node run 5 tasks, each on one node
  4. -p Node=5 -r 'Node' run 5 tasks on one node
  5. -p Node=5 -r '{type: Node, count: 5}' run 25 tasks on 5 nodes

Actually... I kinda really like that last set...

trws commented 8 years ago

Here's a quick rework of the earlier examples with that, also links are now top level and specified either by "link-type>" for out links or "<link-type" for in links where they are recursive by default but take a {min[:max]} links specifier if desired to speed lookups and there's a new shortcut for type/count of <type>*<count>:

1) 10 tasks, each task requires one core, there is no required grouping between cores: an MPI job with 10 ranks

tasks: 10
rspec: Core

2) 10 tasks, each task requires four cores, all four cores for each task must be on the same node: an MPI+OpenMP job that will run a minimum of four threads per rank

tasks: 10
rspec:
  type: Node #non-leaf, shared by default
  has>: Core*4 #exclusive by default

3) 10 tasks, each task requires one core, all must run on the same node, node need not be exclusively allocated: A shared-memory process-parallel job, this could be an MPI test job that uses shared-memory for in-node communication for example, more likely a server or daemon of some kind

per: Core=1
rspec:
  type: Node  #non-leaf, shared by default
  has-a>:  Core*10

4) 1 task, requiring four nodes: A distributed database, or non-flux-PMI-compatible MPI job with an mpiexec

rspec: Node*4

5) 8 tasks, run on exactly two nodes: I include this because I think people will insist on doing it, though I think this is not what we should encourage

per: Node=4
rspec: Node*2

6) 8 tasks, each requiring a socket, but with two tasks to each such socket: Again, I don't like this one bit, but I expect people to want to do it

tasks: 8
per: Socket=2
rspec: Socket #total sockets derived by ceil( tasks/tasks-per-socket)

7) 10 tasks, each reqiring one core, all of which will share a single allocation slot of a license

tasks: 10
job-resources: IntelLicense  #implicit ID matching type for single occurrence
rspec:
  type: Core
  uses>: IntelLicense

8) 4 nodes on cluster X (or sockets, or cores)

tasks: 4
rspec:
  type: Node
  child-of: cluster_X #shortcut for <has: cluster_X for readability, just syntactic sugar 

9) 4 nodes, each with a GPU

tasks: 4
rspec:
  type: Node
  has>: GPU

10) 5 tasks, first four each use one node, last uses 4 nodes

tasks: 5
rspec: Node
rspec-overrides:
  - task: 5
    rspec: Node*4

I at least find this a lot easier to process than the first set. If you want an aggregate and one task, just fill in the rspec. If you want multiple tasks, either specify more rspecs, how many tasks should be placed on each of a resource type, or the number of tasks. Thoughts?

It could probably still use some condensing for command line use, not sure json on the command line is going to cut it for anything but attributes, but as a concept at least?

lipari commented 8 years ago

As long as we're splitting hairs, I'd like to propose another distinction flux-submit submits a request for resources to the resident scheduler. It is task agnostic. flux-start instantiates a Flux instance and executes a command flux-run (real name TBD) launches parallel tasks across specified resources. I'm thinking that most of above discussion refers to flux-run, particularly the notions of task counts and slot definition. I see the ats script made up of a bunch of flux-run commands.

lipari commented 8 years ago

Hold on, I've had a refinement to my thinking. flux-submit should indeed, optionally, accept the task count and slot definitions. The ats scheduler will need to find and select the "slots" for each task. flux-run dutifully launches (and optionally confines) the requested number of tasks across the resource slots the ats scheduler selects. So, when submitting an ats job, flux-submit submits a request to the parent scheduler for aggregate resources. But within the ats script, the ats (child) scheduler receives and schedules requests for n tasks over defined task slots.

trws commented 8 years ago

I just saw your new response @lipari, so I think we're basically on the same page, but just so it's out there I have been thinking of the commands as:

flux-submit submits a request to run potentially parallel tasks asynchronously from the launching shell when resources become available, equivalent to sbatch without the requirement to have a batch script specified instead of what you actually want to run flux-{wreck}run interactively run a potentially parallel command on resources acquired from the enclosing instance through its standard scheduling mechanism, basically submit but with a wait and wired-up IO flux-start instantiate a flux instance and run a command, if run inside a flux instance use flux-run to acquire resources and bootstrap unless asked to use flux-submit flux-run{exec?} run work right now, on exactly what I specify, don't ask the scheduler first

None of these are task agnostic, largely because I find it unintuitive to have a different interface on each one, and to require a script to be able to run a batch job. To be clear, if a user wants to use submit exactly the way they use sbatch, it will do what they expect, but it would also let them directly run their work, for example:

#run-script-I-have-to-use-for-some-reason.sh
#!/bin/bash
$1 my_actual_work

#cmdline
$ sbatch -N 50 run-script-I-have-to-use-for-some-reason.sh srun
$ flux submit -r 'Node*50' run-script-I-have-to-use-for-some-reason.sh frun
#or
$ flux submit -t 50 -r Node my_actual_work
hautreux commented 8 years ago

I agree with the various comments about separation of concerns between resources scheduling and tasks execution, especially the initial comment of @lipari concerning the differences between flux-{submit,start,run}.

I am really finding convenient to express resources requirements for a job as the aggregation of its tasks individual resources requirement ((not necessarily identical among the tasks). Grouping required resources for similar tasks and assigning a count to each group is definitely a great way to describe what a user wants.

From the point of view of a resources allocator, the notion of tasks seems useless. However, having the information about the different groups of resources that are required and their respective counts is very interesting, (IMHO mandatory). Replace that with just one big chunk of resources per group and you lose the initially requested granularity that certainly corresponds at some point to the way resources would be release by malleable/moldable allocations (or failing tasks :)). The scheduler would really benefit in keeping that information.

After reading all that (large) thread, I am thinking that having a notion of "resources shard" could be better suited instead of a task oriented description :

A flux instance could be launched by its parent using the set of potentially heterogeneous allocated resources shards (or one single) and start broker(s) according to the preferred boundary logic (a single one, one per node, one per shard, ...).

In terms of yaml output, it could be something like :

rshard:
    count : 4 # 4 shards of this type
    rspec :
        type: Node
        has>: GPU
        walltime : 2h
    jobspec : # ignored by the scheduler of filtered out while submitting
        taskcount : 1 # one task per shard
rshard:
    count : 1 # 1 shard of this type
    rspec :
        type: Node
        count : 4
        has>: GPU
        walltime : 2h
    jobspec : # ignored by the scheduler of filtered out while submitting
        taskcount : 1 # one task per shard

As you can see I am going again in favor of a list instead of the concept of overrides. IMHO, it would be better to have a flat representation of the shards (~ task-slots) especially when most of them are different. When you only have one type, both representations are similar but when you start having a large number of different shards, overrides are asymetric in terms of representation. In case of malleable/moldable jobs where you can lose shards, it is easier to understand/represent as an growable/shrinkable list.

I am currently sharing @grondo statement : It feels awkward to me to specify more generic part of a request inside of the more specific part.. I see your point @trws of finding a way to represent non hierarchical relations but it is a bit counter intuitive to me and pretty complex for day to day usage. It would require really easy to use command line clients parameters to mask that complexity as you suggested. I still need to think on that subject.

grondo commented 8 years ago
#run-script-I-have-to-use-for-some-reason.sh
#!/bin/bash
$1 my_actual_work

I would say the majority of batch scripts in real world aren't just a single srun. This was one driver behind the hierarchical design of flux. So while I agree with your disdain for the batch script in general, the main usage I see for "batch scripts" in the flux paradigm is handling multi-part workloads like ats, workflows, group projects, and DATs. Any batch script that is just a wrapper for srun could be replaced with a direct flux submit as in your last example above, however, my feeling is that in the real world, this might actually be the uncommon case.

hautreux commented 8 years ago

I agree with @grondo on that.

One thing that we did at a time (before using Slurm) was to have a partition dedicated to run batch scripts (in time sharing constrained by memory footprint). Parallel launches inside bacth scripts were using different partitions, thus avoiding to allocate all the resources while the batch scripts were doing prolog/epilog stuff that could last for long times (depending on the FS access time, compilations involving licenses, ...). This had some drawbacks (in scheduling for example) but was pretty interesting to avoid loosing computing cycles because of non trivial batch scripts (which are very frequent).

grondo commented 8 years ago

From the point of view of a resources allocator, the notion of tasks seems useless. However, having the information about the different groups of resources that are required and their respective counts is very interesting, (IMHO mandatory). Replace that with just one big chunk of resources per group and you lose the initially requested granularity that certainly corresponds at some point to the way resources would be release by malleable/moldable allocations (or failing tasks :)). The scheduler would really benefit in keeping that information.

:+1:

springme commented 8 years ago

Just chiming in to agree that it's great to see the discussion happening online. If we do want to have an interactive "live" discussion we can set up a call if Matthieu wants to help us figure out a time when we would overlap without keeping anyone too late at work. A phone call is also not as good as face-to-face but I think that everyone is doing well with this discussion list. - Becky

From: Mark Grondona notifications@github.com<mailto:notifications@github.com> Reply-To: flux-framework/flux-core reply@reply.github.com<mailto:reply@reply.github.com> Date: Tuesday, September 1, 2015 10:44 AM To: flux-framework/flux-core flux-core@noreply.github.com<mailto:flux-core@noreply.github.com> Subject: Re: [flux-core] Job-shape specification (#354)

While this discussion will be of value for posterity, it amplifies nuances that would have probably been better discussed in person.

Not sure I agree. We have more stakeholders viewing this thread than can just meet in person. I also find seeing the written examples helpful, and like to have time to read and respond.

Also, I think in Tom's example `{ type:Node, count: 40}" the count is the task count.

— Reply to this email directly or view it on GitHubhttps://github.com/flux-framework/flux-core/issues/354#issuecomment-136809372.

hautreux commented 8 years ago

I am mostly a spectator who tries to understand what you are all trying to do when I can. This format suits me well, as @grondo said, it lets have time to read and respond which is great. I will keep trying to give my points of view when I think it might help. Do not hesitate to stop me too :).

springme commented 8 years ago

Matthieu,

It's great to hear from you during the discussions. Please do continue being a spectator who speaks up as often as you'd like. It's very helpful for flux!

Becky

From: hautreux notifications@github.com<mailto:notifications@github.com> Reply-To: flux-framework/flux-core reply@reply.github.com<mailto:reply@reply.github.com> Date: Tuesday, September 1, 2015 2:40 PM To: flux-framework/flux-core flux-core@noreply.github.com<mailto:flux-core@noreply.github.com> Cc: Becky Springmeyer springmeyer1@llnl.gov<mailto:springmeyer1@llnl.gov> Subject: Re: [flux-core] Job-shape specification (#354)

I am mostly a spectator who tries to understand what you are all trying to do when I can. This format suits me well, as @grondohttps://github.com/grondo said, it lets have time to read and respond which is great. I will keep trying to give my points of view when I think it might help. Do not hesitate to stop me too :).

— Reply to this email directly or view it on GitHubhttps://github.com/flux-framework/flux-core/issues/354#issuecomment-136869568.

trws commented 8 years ago

note: Apologies for all the long posts in this, I think better by assembling real examples at each phase, so I hope you'll indulge me a bit.

Before the syntax, to @grondo's point above, I'm sure you're right that many batch scripts have a number of phases, and they're certainly something we need to support. That said, as you pointed out, the majority of the use-cases for them are actually tied to being invoked into an instance. Capacitor is, in some ways, a form of batch script as much as ATS or the others are right?

Also, @hautreux, your note about using a partition for the non-job-step phases of jobs is extremely interesting, and might make for a very good short-term source of evolving jobs. @surajpkn, have you looked at using serial/job step phases as evolving jobs? It seems like it might be a great source of traces...

I think we're getting to the point now of having the same specification, but reversing the order of some things, which is a good place to be as it probably means we're starting to agree on more. Also, I see your point about having it be a list with a multiplier option, that does seem to make it cleaner, as does turning the task count specification inside out. Lets give this a whirl and see what it looks like with a list of rspecs, per to specify task counts on each rspec, and the current link syntax.

This might be getting too dense, but inverting where the task count is specified makes it pretty easy to tie to inner parts of the syntax.

Extra short syntax now includes chaining [input-link>]<type>[*<num>][+<n-tasks-per>][[><has-a link>]|[>:link-type:>]] with left-to-right evaluation and () for grouping.

1) 10 tasks, each task requires one core, there is no required grouping between cores: an MPI job with 10 ranks

Core+1*10 # ((Core resource), [+1] 1 task each), 10 of these

or

per: Core=1 #equivalent to either per: Core=1 in the rspec or tasks:10 in the rspec
rspec: 
    type: Core
    count: 10

2) 10 tasks, each task requires four cores, all four cores for each task must be on the same node: an MPI+OpenMP job that will run a minimum of four threads per rank

note: group introduced here because Node+1*10 > Core*4 could be interpreted as requiring 10 distinct nodes binding a task to each, where what we actually want is 10 groups of four cores where each group is on a node, more than one may be on the same node, this may need further refinement.

(Node > Core*4)+1*10

or

rspec: 
    per: Group=1
    type: Group
    count: 10
    has>:
        type: Node
        has>: Core*4

or

rspec: 
    per: 1
    count: 10
    has>:
        type: Node
        has>: Core*4

3) 10 tasks, each task requires one core, all must run on the same node, node need not be exclusively allocated: A shared-memory process-parallel job, this could be an MPI test job that uses shared-memory for in-node communication for example, more likely a server or daemon of some kind

Node > Core+1*10

4) 1 task, requiring four nodes: A distributed database, or non-flux-PMI-compatible MPI job with an mpiexec

Node*4

5) 8 tasks, run on exactly two nodes: I include this because I think people will insist on doing it, though I think this is not what we should encourage

Node+4*2

6) 8 tasks, each requiring a socket, but with two tasks to each such socket: Again, I don't like this one bit, but I expect people to want to do it

Socket+2*4

7) 10 tasks, each reqiring one core, all of which will share a single allocation slot of a license

Core+1*10>:uses:>IntelLicense

8) 4 nodes on cluster X (or sockets, or cores)

cluster_X > Node*4

9) 4 nodes, each with a GPU

Node*4>GPU

10) 5 tasks, first four each use one node, last uses 4 nodes

rspec: 
    - Node+1*4
    - Node*4

11) @hautreux's example

-j walltime=2h -r Node+1*4>GPU -r Node*4>GPU
pspec :
    walltime : 2h
rspecs :
    - per: Node #1 task per node in this sub-spec
    type: Node
    count : 4 # 4 shards of this type
    has>: GPU
    - tasks: 1 #1 task total for this sub-spec
    type: Node
    count : 4
    has>: GPU
    #alternately
    - per: Group #1 task total for this sub-spec
    type: Group
    has>: 
        type: Node
        count : 4
        has>: GPU

Is that somewhat more palatable?

grondo commented 8 years ago

@trws, I like all the examples, super helpful!

I think the key you're missing from your latest examples is the insight @hautreux had that scheduler doesn't care about tasks. I like the intuitive, straightforward proposal he made for a set of resource shards or groups, with jobspec separately kept.

Maybe I'm missing something in your latest examples, but for instance I find it hard to read 11 and much prefer @hautreux's example, which I think even someone that hasn't followed our discussion can understand.

trws commented 8 years ago

I see the benefit of having number of tasks or tasks per shard decoupled from the top-level of the job, I don't actually see the benefit of having a "jobspec" as a separate nested level under the shard though. It seems like an extra level of nesting we don't necessarily need. Maybe there's a middle-ground there, if I drop the "per-spec" thing such that a shard is always assumed to be the "per" item, then all of the short examples above still apply actually, just assuming that each rspec is a shard, but the longer ones would change.

11 might become something like this?

walltime : 2h # applies to all shards
shards :
    - count: 4
    tasks-each: 1 #probably just call this tasks, but this seems less ambiguous, 
                          #or have this and tasks-total?
    rspec : 
        type: Node
        count : 4 # 4 shards of this type
        has>: GPU
    - tasks-each: 1 #1 task total for this sub-spec
    #implicit count 1
    rspec: Node*4>GPU
grondo commented 8 years ago

We may not need the jobspec but it helped the clarity 100%. This also makes it clear that the resource shard is a resource specification including an optional jobspec (and perhaps extensible to other things in the future, I'm not sure what) The jobspec itself, while not used by the scheduler would be a hint or description of binding and containment for the program creation phase of job construction.

I'm losing sight actually of what is the problem with @hautreux example, perhaps with a defaults section above to avoid replication as you've said? This is a general list of resource shards, where a shard is a scalar of identical rspecs, with optional jobspec.

defaults:
    walltime : 2h # applies to all shards
    jobspec: 
        taskcount: 1
rshard:
    count : 4 # 4 shards of this type
    rspec :
        type: Node
        has>: GPU
rshard:
     count: 1 # 1 shard of this type
     rspec:
        type: Node
        count: 4
        has>: GPU

And so on...

trws commented 8 years ago

I'm just unclear on what the division is between the jobspec: and other properties of the shard. What value does it add to have that be a distinct section rather than a set of attributes? It's worth noting that, aside from names of certain attributes, I'm good with this as a config-file or batch-script specification style. I just would prefer not to require an extra level in the common case if it can be avoided.

That was actually my issue with the shards to begin with, is that it adds an extra level. That and the fact that it loses the information provided by specifying the resource type that the user wants associated with their tasks that using the (admittedly a bit odd) "per" syntax provides, which would be useful for making better affinity decisions.
I'm pretty well convinced that this is worth it though, just quibbling over technicalities at this point.

On 1 Sep 2015, at 16:08, Mark Grondona wrote:

We may not need the jobspec but it helped the clarity 100%. This also makes it clear that the resource shard is a resource specification including an option jobspec (and perhaps extensible to other things in the future, I'm not sure what) The jobspec itself, while not used by the scheduler would be a hint or description of binding and containment for the program creation phase of job construction.

I'm losing sight actually of what is the problem with @hautreux example, perhaps with a defaults section above to avoid replication as you've said? This is a general list of resource shards, where a shard is a scalar of identical rspecs, with optional jobspec.

defaults:
 walltime : 2h # applies to all shards
 jobspec:
     taskcount: 1
rshard:
 count : 4 # 4 shards of this type
 rspec :
     type: Node
     has>: GPU
rshard:
  count: 1 # 1 shard of this type
  rspec:
     type: Node
     count: 4
     has>: GPU

And so on...


Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/354#issuecomment-136888255

grondo commented 8 years ago

@trws, good points. My argument was that the extra section seemed to increase clarity to the reader and author of the thing, that is all -- it also makes it obvious which section contains information that isn't used by the scheduler (typically). Finally, it sets up a scheme for later, similar extensions that could be grouped with the specification, ignored by the scheduler, but perhaps picked up by other components in an instance (custom extensions even).

hautreux commented 8 years ago

I agree with @grondo on that. The idea was to annotate a resources requirement description to add things related to tasks management. This information could be filtered out before transmission to the scheduler.

Maybe we should call it taskspec instead of jobspec. In case one would like to have more than one task per shard with a particular affinity, he could ask for for example :

Rshard:
    ...
    Taskspec:
        Count : 4
        Affinity : socket
        Affinity-mode : block
        Command : /bin/true
        EnvVars : [ ...]

This can help to represent both task+resources desc in the same related list and use what is necessary depending of the context.

(Sent from my phone...the result could be weird, my apologies...)

trws commented 8 years ago

So, I'm actually on the fence about this now for a couple of reasons. When I'm looking at this, I'm thinking about what a user, defined as someone that is not a scheduler developer, would expect to write to get a given effect. The fact it isn't used by the scheduler, which as far as the user is concerned is an implementation detail of a subset of the commands that would take these things, doesn't really make the argument for me. I also think part of my knee-jerk reaction was the name of the thing rather than it's function, since the entire spec is a job-spec, where this is a specification of the tasks to run on shards which is a subset of the job or program spec (at least in my deranged mind...)That said, having a section to define the workload to attach to given shards is starting to grow on me as I think about some of the more complicated use-cases. For example, a user that wants to run a distributed database alongside their job might do this:

defaults:
    walltime : 2h # applies to all shards
shards:
   - count : 4 # 4 shards of this type
     resource :
        type: Node
        has>: GPU
    task : some_exe
  - count : 2
    resource : Node * 2
    task :
      each : 1  #not sure if we need this as opposed to count, but having a way to say total vs each seems useful
      command : mongodb...

What I'm starting to really like about this extra grouping is it also could be a list, so it might be easy to put the databases on the same allocation:

defaults:
    walltime : 2h
shards:
   - count : 4 # 4 shards of this type
     resources :
        type: Node
        has>: GPU
    tasks : 
      - some_exe #runs on all shards
      - count : 2 #only run two, on some pair of nodes in this set
        command : mongodb...

A command-line syntax for this is likely to be a set of repeatable arguments, or list arguments, a bit like hydra takes for multi-tasks. Some of what I had before wont work, the tasks have to be tied to a full shard rather than a specific level in the shard, so that needs some re-thinking.

trws commented 8 years ago

@hautreux, it looks like we had a conflict in-flight, but I pretty much completely agree. =)

grondo commented 8 years ago

Yeah, this seems to be going somewhere good. I agree on the naming, but I was thinking less about the names of the sections and more about generalized design...

BTW, I really like where this is going because it seems like with this approach you could fully specify arbitrarily complex "jobs" (including resources and sets of commands to run) in a single file, parts of which could be used by different subsystems within an instance of flux.

trws commented 8 years ago

Yeah, this is starting to feel really powerful... In fact, where we are, I think we can even do arbitrarily complex graph specs with arbitrary tasks. Here's a re-work of the examples, along with a crack at a command-line syntax. It's based on separating shards by --, mpirun-task-style, where if the first set has no command it is the defaults block.

CLI definitions:

Arguments to the above are either json blobs, or short-form syntax of the form:

resource-block : resource-spec[\[count\]][>[link-type]>]
link-type : default: has-a
resource-spec :   type
                | (resource-block)

1) 10 tasks, each task requires one core, there is no required grouping between cores: an MPI job with 10 ranks

#implicit shard for single shard
count: 10 
resource: Core

CLI : -c 10 -r Core

2) 10 tasks, each task requires four cores, all four cores for each task must be on the same node: an MPI+OpenMP job that will run a minimum of four threads per rank

count : 10
resource:
  type: Node #non-leaf, shared by default
  has>: Core[4] #exclusive by default

CLI : -c 10 -r 'Node>>Core[4]

3) 10 tasks, each task requires one core, all must run on the same node, node need not be exclusively allocated: A shared-memory process-parallel job, this could be an MPI test job that uses shared-memory for in-node communication for example, more likely a server or daemon of some kind

# implicit count 1
rspec:
  type: Node  #non-leaf, shared by default
  has>:  Core[10]
task:
  count : 10

CLI : -t '{count : 10}' -r 'Node>>Core[10]'

4) 1 task, requiring four nodes: A distributed database, or non-flux-PMI-compatible MPI job with an mpiexec

resource: Node[4]

CLI : -r 'Node[4]'

5) 8 tasks, run on exactly two nodes: I include this because I think people will insist on doing it, though I think this is not what we should encourage

resource: Node[2]
task:
    count : 8

CLI : -r 'Node[2]' -n 8

6) 8 tasks, each requiring a socket, but with two tasks to each such socket

count: 4
resource: Socket 
task:
    count: 2

CLI : -c 4 -r 'Socket' -n 2

7) 10 tasks, each reqiring one core, all of which will share a single allocation slot of a license

count: 10
job-resources: IntelLicense  #implicit ID matching type for single occurrence
resource:
  type: Core
  uses>: IntelLicense

CLI : -c 10 -j 'IntelLicense' -r 'Core>uses>IntelLicense'

8) 4 nodes on cluster X (or sockets, or cores)

resource:
  type: Node[4]
  child-of: cluster_X #shortcut for <has: cluster_X for readability, just syntactic sugar 

CLI : -c 10 -j 'IntelLicense' -r 'Core>uses>IntelLicense'

9) 4 nodes, each with a GPU

count: 4
resource:
  type: Node
  has>: GPU

CLI : -c 10 -j 'IntelLicense' -r 'Core>uses>IntelLicense'

10) 5 tasks, first four each use one node, last uses 4 nodes

shards:
  - count : 4
    resource: Node
  - resource: Node[4]

CLI : -c 4 -r 'Node' -- -r 'Node[4]'

hautreux commented 8 years ago

I am sharing your thoughts that things are evolving in an interesting direction.

Having the task annotations is great, we could even think about extending that to other kind of annotations (application for example).

I have a few comments on your last post @trws that makes me feel like there are still missing pieces.


When I first proposed the task count in the jobspec/taskspec, it was to have a way to ask for a kind of overcommit of tasks on shards, having by default one task per shard. Thus the shards count was reflecting the amount of tasks to launch, each of them having its own set of resources as described for the shard.

In your example 3), your are using the task count to ask for 10 tasks on the same node. That is smart but in my mind it is not the way it was supposed to be requested by default (but one could clearly do it this way too). The problem is that If I want to do it my way, I am stuck because I do not have a way to request that a group of shards share one of their upper levels (being a node, a lineboard, a switch, a cluster, ...). That is one of the initial issue you were fighting with.

The ''child-of'' pragma can not be used here as I do not want to ask for a particular node (or lineboard, cluster...) but want one node (or ...) to group them all. This make me think that we may group shards together having the capability to add ''share-same'' pragma to ask for what I need.

Thus we could have things like for example :

1)

shard-group:
    share-same : node
    shards:
        - count : 10
           resource: Core, Memory=4G
        - count : 1
           resource : Core[6], Memory=24G

To ask for a 11 tasks job running on one node, the first 10 tasks using one core and 4G of RAM, the last one using 6 cores and 24G of RAM.

2)

shard-group:
    share-same : lineboard # or cluster or lineboard being sw21_line13
    shard-group:
        share-same : node
        share-exclusively : yes
        shards:
            - count : 10
               resource: Core, Memory=4G
            - count : 1
               resource : Core[6], Memory=24G
    shard-group:
        share-same : node
        share-exclusively : no
        shards:
            - count : 10
               resource: Core, Memory=4G
            - count : 1
               resource : Core[6], Memory=24G

To ask for two times what we ask in 1) but using two nodes sharing the same switch lineboard, one node being allocated exclusively, the other not.

I am not sure that it is as powerful as links but ease the expression of "same parents of type X". The nested structures might be hard to manage. Let me know what you think of that.


In your example 7), I would put the IntelLicense in its own shard, with something like :

shards :
    - count: 10
      resource: Core
     - count 1
      resource: IntelLicense
      task:
          count: 0

I may miss the interest of having the "use" information. But it find easier to express all resources requirements the same way., don't you ?


I am wondering if we may discuss in this issue things like temporality and/or dependancies among shards. What I mean is that in order to schedule malleable jobs (size changing over time), you may need shards having different walltimes and/or different even different relative start time (or before/after/with dependencies). This could also be used to efficiently represent the traditional sequential-prolog/parallel-exec/sequential-epilog that most batch submissions use. One example that could be used with some mpispawn stuff to boost things when necessary for the code :

shards:
    - count : 10
      walltime: 2h
      resource: Core
      task:
           command: my_mpi_app
    - count : 6
      relative_starttime: 90m
      walltime: 30m
      resource: Core
      task:
           command: my_mpi_booster_app

or reverse, like for a space shuttle launch having its booster separated at some point :

shards:
    - count : 10
      walltime: 2h
      resource: Core
      task:
           command: my_mpi_app
    - count : 6
      walltime: 30m
      resource: Core
      task:
           command: my_mpi_booster_app_that_will_end_within_30m

I am pretty fine with the CLI proposal, it is easy enough but might need an additional grouping logic to express the "share-same" information. I suppose that there would also be "command [params]" argument for each shard definition (optional in case of shard definition with 0 task attached or when just requesting resources without any hints about tasks) ?


Hum...sorry for the long post... hoping that it makes some sense...

trws commented 8 years ago

I'm not sure about handling grouping that way @hautreux, I see what you mean about wanting to express it, but we lost this when we factored things out to get the task-id where it is. My older proposal had this by specifying what level of a shard tasks bind to, rather than what they share. Given what we currently have, I would probably want to do this with links and IDs. For example:

1)

    shards:
        - resource: Node
        id: Node_1  #this is not a specific node, just an ID bound to whatever node is selected
        - count : 10
           resource: Node_1>>Core, Memory=4G
        - count : 1
           resource : Node_1>>Core[6], Memory=24G

Or perhaps to be able to nest resources and shards, to get the per-style functionality back? I'm going completely off the cuff here so this might make no sense, but perhaps something like this?

resource:
    type: Node
    shards:
        - count : 10
           resource: Core, Memory=4G
        - count : 1
           resource : Core[6], Memory=24G

Hmm... I kinda like that. Parsing it might not be fun, but we get arbitrary nesting and level specification back, and at least for me that's pretty easy to read...

Having multiple phases is something I think we were planning to handle through using sub-instances, so probably doesn't fit here, but being able to express that kind of information across programs is certainly useful. I'm not sure if it covers everything, but if you think of each phase as a program we can specify the actual shape with the rest of this thread, so this is what I had been thinking with respect to general dependency specification.. I think I mentioned it offhand at a recent meeting, but basically it's based on allowing users to specify named chains of tasks, where each name forms a serialized queue of work through the system, for multi-user systems we would probably want to have a default user-unique prefix unless they override it. The advantage is, there's no need to specify a particular previous or current job or job id, but you can still express arbitrarily complex DAGs of jobs. i.e.

flux submit -d a run_a
flux submit -d in:a,out:b run_b
flux submit -d out:c run_c
flux submit -d a,in:c run_d
flux submit -d in:a,in:b run_e

These would execute according to this graph: image

hautreux commented 8 years ago

I am okay with the nested version using type instead of share too. It is very readable and sounds to bring the required information.

Concerning the graph, that sounds good to orchestrate a workflow of jobs but how to you handle a growing-then-shrinking job without specifying in advance the way some of its shards will pop-up or dissappeared during its execution ? I mean that the scheduler need to know that to schedule efficiently such a beast and give/or get shards at the right moment according to the ''roadmap''.

Le jeu. 3 sept. 2015 04:02, Tom Scogland notifications@github.com a écrit :

I'm not sure about handling grouping that way @hautreux https://github.com/hautreux, I see what you mean about wanting to express it, but we lost this when we factored things out to get the task-id where it is. My older proposal had this by specifying what level of a shard tasks bind to, rather than what they share. Given what we currently have, I would probably want to do this with links and IDs. For example:

1)

shards:
    - resource: Node
    id: Node_1  #this is not a specific node, just an ID bound to whatever node is selected
    - count : 10
       resource: Node_1>>Core, Memory=4G
    - count : 1
       resource : Node_1>>Core[6], Memory=24G

Or perhaps to be able to nest resources and shards, to get the per-style functionality back? I'm going completely off the cuff here so this might make no sense, but perhaps something like this?

resource: type: Node shards:

  • count : 10 resource: Core, Memory=4G
  • count : 1 resource : Core[6], Memory=24G

Hmm... I kinda like that. Parsing it might not be fun, but we get arbitrary nesting and level specification back, and at least for me that's pretty easy to read...

Having multiple phases is something I think we were planning to handle through using sub-instances, so probably doesn't fit here, but being able to express that kind of information across programs is certainly useful. I'm not sure if it covers everything, but if you think of each phase as a program we can specify the actual shape with the rest of this thread, so this is what I had been thinking with respect to general dependency specification.. I think I mentioned it offhand at a recent meeting, but basically it's based on allowing users to specify named chains of tasks, where each name forms a serialized queue of work through the system, for multi-user systems we would probably want to have a default user-unique prefix unless they override it. The advantage is, there's no need to specify a particular previous or current job or job id, but you can still express arbitrarily complex DAGs of jobs. i.e.

flux submit -d a run_a flux submit -d in:a,out:b run_b flux submit -d out:c run_c flux submit -d a,in:c run_d flux submit -d in:a,in:b run_e

These would execute according to this graph: [image: image] https://cloud.githubusercontent.com/assets/660149/9649075/afd74628-51a4-11e5-872b-9243907b5f4b.png

— Reply to this email directly or view it on GitHub https://github.com/flux-framework/flux-core/issues/354#issuecomment-137300997 .

grondo commented 8 years ago

I like the workflow example, and probably before settling on anything here we should review the existing workflow schedulers and maybe support more than one style. I especially think that not having the traditional "depends on jobid" is a huge improvement in flexibility. (you might want to provide a way to set in: out: parameters after the fact though, so users could depend on already submitted, but non-annotated jobs. Or have a default out: for all jobs like the jobid or a dynamically generated adjective/noun pair like docker does [ e.g. "restless_badger"] could be fun. The auto-out parameter could be done by a job submit filter so perhaps not necessary in the spec)

grondo commented 8 years ago

Back to the overall job spec -- I just wanted to bring up the point about remembering our users ;-)

Not that any of these proposals so far are bad, but we should try to keep in mind that the main users of this system are not going to be computer science researchers :-) Therefore, we need to ensure that the spec we end up with is intuitive and easy to use, especially for the common cases (and by common cases I mean the common cases for our *users). I have a feeling that some of the examples proposed above would not be well received by the (thus far imaginary) user community.

We might even want to end up with a presentation or white paper (not sure of the format) on the final spec and present this or share it with other stakeholders (i.e. the people that have to support the users) to get their buy-in on the result.

Obviously there is going to be some added complexity with the new (orders of magnitude) increase in expressiveness, but in our presentation and documentation we have to make a good case for that.

trws commented 8 years ago

That's a good point @grondo. I had been thinking that every job would have some kind of implicit reference so you could use it as the start of a chain, but I'm not sure what it should look like. As you say, job-id might be a good option, or we could have a generic "depend-wait" operator that takes one stream dependency id and makes another depend on it, so you could attach a job id to the head of a new chain without having to list the ID on every item in the chain.

It would certainly be nice to support other models as well, but the reason I wanted to start with this one is it can support expressing any arbitrary directed acyclic graph, and it's impossible to submit jobs in such a way that their dependencies result in a deadlock. Of course, if a user sets up their dependencies in flux and then uses an out-of-band synchronization that causes a deadlock, we can't help that, but it would be impossible to cause one with just this specification style.

On the job spec, I agree we should make it as consistent and simple as possible. I'm starting to really like the idea of allowing a generally nested specification where, because it (at least to me, please say so if you disagree) seems really consistent and readable. Take this crazy example, do you find it readable?

defaults:  #dynamically scoped defaults, everything under this level can inherit this
  walltime : 2h #default walltime
instance: #synonym for "shards" with a default command of flux-broker
    count: 20
    resources: Node
    programs: 
      - defaults:
            task:
                command: capacitor -c crazy-uq-commands.cmd
        instance: 
          - shards:
                count: 1
                resources: Node[1:] # one or more?
                #inherits task from the default
          - count: 1
            resources: Node
            task:
                command: monitor-capacitor-and-kill-if-annoying
      - depend: q1
        shards:
            walltime: 30m
            count: 15
            resources: Node>Core[4]
            task: 15-node-quad-thread-thing #bare string on task as
                                            #command line
      - depend: q1 #waits on previous listing, and blocks next with q1
        resources:
            type: Node
            shards:
                count: 10
                resources: Core[2]  # 2 cores each
                task: dual-thread-10-proc-single-node-thing

All this assumes is that there is an implicit decay such that resources, can stand in for shards, can stand in for programs so you could specify everything or as little as:

count: 5
resources: Node
task: 5-node-thing

Either way, I completely agree that we should get user feedback on all of this. If users don't want to use it, they wont until forced to. Would much rather they transition because they prefer it. =)

garlick commented 8 years ago

I've been putting off trying to ingest this thread. Has enough consensus been reached that it is feasible to summarize? If we're not ready for a draft RFC, maybe a summary could be placed in a new issue referencing this one and continue on from there?

lipari commented 8 years ago

Ok, I'll take a crack at a very high level summary...

trws commented 8 years ago

We might be ready to give an RFC a try, I'd be interested to hear others thoughts on that point. The requirements haven't really changed, and we seem to have settled on some workable semantics, but the syntax and naming of constructs is still pretty fluid. It probably wouldn't happen until the weekend, but should I try and write up a draft RFC / spec that we can iterate on?

The dependency thing should be another ticket, I'll create that one now and flesh it out a little bit. It needs more options than I implied before, as @grondo pointed out, to handle post-run dependencies and a couple of other cases that have come up since.

hautreux commented 8 years ago

Trying to sum up the ideas on a RFC and put the dependency thing in an other ticket are very good ideas. I'll try to review and comment the RFC once started.

grondo commented 8 years ago

hashicorp just released a cluster RM/scheduler called Nomad.

Though not really and HPC scheduler/RM, I did find it really interesting to read about Nomad, and specific to this Issue, they have a "job specification" syntax that has some of the same ideas we've been throwing around here, so I thought I'd share it:

https://www.nomadproject.io/docs/jobspec/index.html

lipari commented 8 years ago

Very interesting. They make a good argument for using HCL as the job specification language.

trws commented 8 years ago

This is pretty interesting. It reminds me of the GCL (google configuration language) referenced from the Borg paper, except that it isn’t turing complete. We could support something like that pretty easily, but I’m not sure I like having the machine-generated and user-generated syntaxes be different like that. It means the comments and other materials can’t be preserved, because they can’t be represented in the transport medium. Still, having a user-friendly syntax does make sense. It also interests me that the YACC syntax for their configuration language is generating golang code… very cool.

One possible alternative here is to provide something that takes lua code that produces a table, much like the current RDL system, and emits compliant YAML/JSON for use with the system? Alternately the “short-form” syntax could obtain an assignment option and match the functionality pretty trivially, but then it may become incompatible with transferring unobstructed through a YAML parser. Not sure, what do you all think?

Also, I’ve been looking around at some of the others today and, unbeknownst to me previously, kubernetes apparently uses a YAML/JSON interface oddly similar to the one we’ve been putting together. The extra interesting thing there is that they’re using an API management, documentation and specification system called Swagger for their REST API that provides a discoverability and verification option for all of it, with versioning against the API. We do need to have versioning in the system before we stabilize, so it might be good to look at something like that, or at least setting up something that can verify the validity of JSON/YAML/whatever for a given flux interface.

On 29 Sep 2015, at 11:00, Don Lipari wrote:

Very interesting. They make a good argument for using HCL as the job specification language:


Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/354#issuecomment-144136694

lipari commented 8 years ago

This issue provides a summary of Issue 354 and attempts to save the reader some time and eliminate the need to read Issue 354 line by line.

The overall goal of the issue is to design a means to specify the requirements of a Flux job. These requirements would be submitted to Flux, scheduled by a scheduler, and ultimately result in a Flux instance running one or more Flux programs.

The precursors to this discussion are terms defined in:

The general requirements are to specify in a generic and versatile way a request for resources. Traditionally this has been a request for nodes, cores and memory and appears as options to the various batch submission commands. A batch scheduler typically finds and selects these resources and creates a resource allocation.

Flux will by nature include a vastly enhanced collection of resources as well as elaborate associations of these resources. The requirements for specifying such a package include aspects of:

A number of terms were considered at various points in the discussion to describe the smallest schedule-able resource unit:

A big part of Issue 354 explores the need to include task counts and task mapping to resources in the Flux job request. @trws argues that this is critical that the Flux job request include tasks in the specification. He notes that even a simple request for nodes needs to specify the number of brokers required and how many brokers to run on a node. The contrasting view is that the scheduler need only be concerned with the schedule-able units or resource shards, and that any task association is more a commentary that belongs in the context of the program specification (as defined in RFC 8).

We discussed how to specify task mapping:

We discussed the means to designate resource associations and compositions: “has-a”, and “uses” relationships. We weighed sparse, graph-inspired, directional link depictions against more verbose but more intuitive options, such as indentation.

We also considered additional aspects such as:

This brought us to notions of dependencies and how (and where) to support MPMD.

As complex as this quickly becomes, we sought to devise the most versatile means for specifying resources using formats that are intuitive and easy for a user to master, while minimizing the opportunity for ambiguity.

We gave thought to leveraging existing solutions such as HCL and Kubernetes.

As more and more agreement was reached, @trws wrote and submitted the RFC 10 PR to embody the culmination of the ideas. And that proposal generated further discussion in comments attached to that PR.

garlick commented 7 years ago

Although there is a lot of good info in this issue, I think the work has migrated over to the rfc project, so closing. Please reopen if you disagree!