flux-framework / rfc

Flux RFC project
https://flux-framework.readthedocs.io/projects/flux-rfc/
7 stars 13 forks source link

new rfc: tasks attributes #146

Open SteVwonder opened 5 years ago

SteVwonder commented 5 years ago

Not sure if this is something we want to document in RFC14 or leave up to individual implementations, but where do we want to put the paths for stdout and stderr redirection? My initial thought was under the attributes key for each task:

tasks
- command: md-simulation.exe
  slot: compute
  count: {total: 1}
  attributes:
    stdout: /path/to/sim.out
    stderr: /path/to/sim.err
- command: database.exe
  slot: data
  count: {total: 1}
  attributes:
    stdout: /path/to/db.out
    stderr: /path/to/db.err

If we decide that this is too low of a level of detail for RFC14, we should document the specific attributes supported by flux-core somewhere in the flux-core repo.

garlick commented 5 years ago

As a general point, I think it's probably fine to build on RFC 14 in other RFCs, e.g. to define specific attributes.

Also somewhat off topic, but those mustache templates from wreck are something we might want to consider carrying forward into the new design.

SteVwonder commented 5 years ago

those mustache templates from wreck

Ooooo. Those are really nice. I was unaware of them.

Updated example:

tasks
- command: md-simulation.exe
  slot: compute
  count: {total: 1}
  attributes:
    stdout: /path/to/{{cmd}}-{{taskid}}.out
    stderr: /path/to/{{cmd}}-{{taskid}}.err
- command: database.exe
  slot: data
  count: {total: 1}
  attributes:
    stdout: /path/to/db-{{id}}.out
    stderr: /path/to/db-{{id}}.err
SteVwonder commented 5 years ago

Expanding the scope of this issue a bit... While working on the srun/sbatch flux-jobspec subcommands, I ran across SLURM's distribution flag, which supports controlling distribution at multiple levels. In the current RFC14, we have distribution limited to a string. I wonder if it would be worth expanding that out to support dictionaries. For example:

tasks:
  - attributes:
      distribution:
        nodes: block
        sockets: fcyclic
        cores: cyclic

A single string would, of course, still work. An example using SLURM's syntax:

tasks:
  - attributes:
      distribution: block:fcyclic:cyclic
grondo commented 5 years ago

Isn't the distribution key specified to be at the same level as attributes, i.e. directly in the tasks dictionary? Regardless, I agree that being more flexible in the attributes sections of the tasks portion of jobspec is probably the right way to go.

If you think about tasks: attributes and tasks: distribution they are input data to the job shell not the Flux instance, so I'm not sure how much we want to encode these sections into an RFC, as this may limit flexibility of alternate job shell implementations.

Then again, if the contents of these sections are left completely open ended, then we will need a scheme to document and validate them at runtime. That is, probably job shell as well as scheduler will have to hooks for validators into the job submission workflow.

grondo commented 5 years ago

Isn't the distribution key specified to be at the same level as attributes, i.e. directly in the tasks dictionary?

Uhh, sorry, I misread your example. What you have is indeed correct and you can ignore my first sentence here.

grondo commented 5 years ago

but where do we want to put the paths for stdout and stderr redirection? My initial thought was under the attributes key for each task:

It would be best if there was some place to put a generic output dictionary that applied to all tasks, and allow it to be overridden in the per-task attributes dictionary. In fact, I kind of wish the tasks section had a place for generic attributes.

In any case, specifying output as a dictionary will allow more flexibility, e.g. some job shells may offer buffered vs unbuffered I/O, pty, etc, so stdout stderr keys will not suffice.

SteVwonder commented 5 years ago

If you think about tasks: attributes and tasks: distribution they are input data to the job shell not the Flux instance, so I'm not sure how much we want to encode these sections into an RFC, as this may limit flexibility of alternate job shell implementations.

That makes sense to me for the sake of flexibility, but I think there should be documentation somewhere, even if it is just a wiki entry on GitHub or a section in the man page.

Then again, if the contents of these sections are left completely open ended, then we will need a scheme to document and validate them at runtime. That is, probably job shell as well as scheduler will have to hooks for validators into the job submission workflow.

At least in the case of distributions, I already have a validation function included in the flux-jobspec srun subcommand, so we can leverage that. And in general, validating the format of certain keys should be pretty feasible. Validation of the mustache substitution could be trickier. Do we just scan for {{\w+}} and throw a warning if the job-shell doesn't support mustache substitution (or the particular tag within the mustaches)?

grondo commented 5 years ago

That makes sense to me for the sake of flexibility, but I think there should be documentation somewhere, even if it is just a wiki entry on GitHub or a section in the man page.

Yeah, I was thinking, like arguments for any command, this could be documented in man pages and online documentation.

Validation of the mustache substitution could be trickier. Do we just scan for {{\w+}} and throw a warning if the job-shell doesn't support mustache substitution (or the particular tag within the mustaches)?

Perhaps part of the submission workflow will have to have a hook for the targeted job shell to have a chance to validate the submitted tasks section. At first I thought this could be part of the job-manager workcrew, however there are multiple problems with that approach. Maybe a hook in the flux-job submit plumbing command could run the job shell in validation mode?

It does seem like this kind of validation should be done before submission as you've done with flux jobspec srun and not part of the internal instance validation. Now that I write that, it feels like I'm stating the obvious.

SteVwonder commented 5 years ago

Yeah, I was thinking, like arguments for any command, this could be documented in man pages and online documentation.

👍

At first I thought this could be part of the job-manager workcrew, however there are multiple problems with that approach.

Hmmm, that was my first thought as to the best place to do a full validation (with other validations occurring at submission and runtime). What problems are you anticipating with that approach?

grondo commented 5 years ago

What problems are you anticipating with that approach?

Currently, a user is allowed to run any old job shell, including one they've modified in their own working tree for example. Even if the flux-broker (running as unprivileged user flux) had permission to execute the user's job shell, doing so would be a security hazard (you'd allow the user to run arbitrary code as user flux)

SteVwonder commented 5 years ago

doing so would be a security hazard (you'd allow the user to run arbitrary code as user flux)

Oh. Great point!

Maybe a hook in the flux-job submit plumbing command could run the job shell in validation mode?

👍 This seems like a good option.