Open SteVwonder opened 5 years ago
As a general point, I think it's probably fine to build on RFC 14 in other RFCs, e.g. to define specific attributes.
Also somewhat off topic, but those mustache templates from wreck are something we might want to consider carrying forward into the new design.
those mustache templates from wreck
Ooooo. Those are really nice. I was unaware of them.
Updated example:
tasks
- command: md-simulation.exe
slot: compute
count: {total: 1}
attributes:
stdout: /path/to/{{cmd}}-{{taskid}}.out
stderr: /path/to/{{cmd}}-{{taskid}}.err
- command: database.exe
slot: data
count: {total: 1}
attributes:
stdout: /path/to/db-{{id}}.out
stderr: /path/to/db-{{id}}.err
Expanding the scope of this issue a bit...
While working on the srun/sbatch flux-jobspec subcommands, I ran across SLURM's distribution
flag, which supports controlling distribution at multiple levels. In the current RFC14, we have distribution limited to a string. I wonder if it would be worth expanding that out to support dictionaries. For example:
tasks:
- attributes:
distribution:
nodes: block
sockets: fcyclic
cores: cyclic
A single string would, of course, still work. An example using SLURM's syntax:
tasks:
- attributes:
distribution: block:fcyclic:cyclic
Isn't the distribution
key specified to be at the same level as attributes
, i.e. directly in the tasks
dictionary? Regardless, I agree that being more flexible in the attributes sections of the tasks
portion of jobspec is probably the right way to go.
If you think about tasks: attributes
and tasks: distribution
they are input data to the job shell not the Flux instance, so I'm not sure how much we want to encode these sections into an RFC, as this may limit flexibility of alternate job shell implementations.
Then again, if the contents of these sections are left completely open ended, then we will need a scheme to document and validate them at runtime. That is, probably job shell as well as scheduler will have to hooks for validators into the job submission workflow.
Isn't the distribution key specified to be at the same level as attributes, i.e. directly in the tasks dictionary?
Uhh, sorry, I misread your example. What you have is indeed correct and you can ignore my first sentence here.
but where do we want to put the paths for stdout and stderr redirection? My initial thought was under the attributes key for each task:
It would be best if there was some place to put a generic output
dictionary that applied to all tasks, and allow it to be overridden in the per-task attributes
dictionary. In fact, I kind of wish the tasks
section had a place for generic attributes.
In any case, specifying output as a dictionary will allow more flexibility, e.g. some job shells may offer buffered vs unbuffered I/O, pty, etc, so stdout
stderr
keys will not suffice.
If you think about tasks: attributes and tasks: distribution they are input data to the job shell not the Flux instance, so I'm not sure how much we want to encode these sections into an RFC, as this may limit flexibility of alternate job shell implementations.
That makes sense to me for the sake of flexibility, but I think there should be documentation somewhere, even if it is just a wiki entry on GitHub or a section in the man page.
Then again, if the contents of these sections are left completely open ended, then we will need a scheme to document and validate them at runtime. That is, probably job shell as well as scheduler will have to hooks for validators into the job submission workflow.
At least in the case of distributions
, I already have a validation function included in the flux-jobspec srun
subcommand, so we can leverage that. And in general, validating the format of certain keys should be pretty feasible. Validation of the mustache substitution could be trickier. Do we just scan for {{\w+}}
and throw a warning if the job-shell doesn't support mustache substitution (or the particular tag within the mustaches)?
That makes sense to me for the sake of flexibility, but I think there should be documentation somewhere, even if it is just a wiki entry on GitHub or a section in the man page.
Yeah, I was thinking, like arguments for any command, this could be documented in man pages and online documentation.
Validation of the mustache substitution could be trickier. Do we just scan for {{\w+}} and throw a warning if the job-shell doesn't support mustache substitution (or the particular tag within the mustaches)?
Perhaps part of the submission workflow will have to have a hook for the targeted job shell to have a chance to validate the submitted tasks
section. At first I thought this could be part of the job-manager workcrew, however there are multiple problems with that approach. Maybe a hook in the flux-job submit
plumbing command could run the job shell in validation mode?
It does seem like this kind of validation should be done before submission as you've done with flux jobspec srun
and not part of the internal instance validation. Now that I write that, it feels like I'm stating the obvious.
Yeah, I was thinking, like arguments for any command, this could be documented in man pages and online documentation.
👍
At first I thought this could be part of the job-manager workcrew, however there are multiple problems with that approach.
Hmmm, that was my first thought as to the best place to do a full validation (with other validations occurring at submission and runtime). What problems are you anticipating with that approach?
What problems are you anticipating with that approach?
Currently, a user is allowed to run any old job shell, including one they've modified in their own working tree for example. Even if the flux-broker (running as unprivileged user flux
) had permission to execute the user's job shell, doing so would be a security hazard (you'd allow the user to run arbitrary code as user flux
)
doing so would be a security hazard (you'd allow the user to run arbitrary code as user flux)
Oh. Great point!
Maybe a hook in the flux-job submit plumbing command could run the job shell in validation mode?
👍 This seems like a good option.
Not sure if this is something we want to document in RFC14 or leave up to individual implementations, but where do we want to put the paths for stdout and stderr redirection? My initial thought was under the
attributes
key for each task:If we decide that this is too low of a level of detail for RFC14, we should document the specific attributes supported by flux-core somewhere in the flux-core repo.