design shell plugin for file broadcast

garlick commented 3 years ago

Following up on a coffee time discussion with @jameshcorbett

There is a need for an C API for copying file(s) to a job. (What other requirements are there?)

Some notes from the discussion

a shell plugin could implement a distributed file copy service
plugins can register job-specific service "methods"
plugins can also get plugin specific options from the jobspec attributes (which can be set on flux mini command line)
plugins can address RPCs to shell ranks within the job
see the PMI plugin for an example of a virtual TBON (in this case to gather PMI keys)
the job specific service name can be found in the job eventlog (see MPIR plugin for an example)
files should be sent in chunks (say 4K) to avoid head of line blocking in the broker
it may be handy to use the per-job tmpdir established by the tmpdir job plugin (FLUX_JOB_TMPDIR environment variable should be set in the job)

Additional thoughts:

src/common/libutil/kary.h provides some helper function for determining virtual TBON peers etc.
Use a streaming RPC to send the chunks to the destination.
Note that the shells are only running while the job is running, so this wouldn't work for stage-in while the job is still pending
The service could actually be the same on each shell rank, with the semantics of "copy to local file and TBON subtree".

garlick commented 3 years ago

Couple of more thoughts:

Need to handle open errors on remote files (e.g. file exists)
Need to handle write errors to remote files (e.g. file system full).
Errors should be propagated back to the API
A flux_future_t based interface is desirable to enable reactive programming
May want a way to pass open flags into the API (e.g. O_TRUNC)
API should have a way to select a subset of shell ranks as targets
API should have a way to select any destination path (e.g. escape FLUX_JOB_TMPDIR if desired)
Idea: option to let service select the destination path and set a configurable environment variable to point to it

I hate to even mention this because I really like the idea of having this capability in Flux, but I think this problem could also be solved with mpifileutils or similar with "alloc bypass" from #3740, copying R from the target job. Two advantages of that approach are 1) uses RDMA, and 2) portability.

garlick commented 3 years ago

Following up on coffee call. It doesn't look like mpifileutils dbcast works like I thought it did. It appears to read stripes of a file from all ranks of a parallel job, not one rank. This is what @jameshcorbett was saying and I just wasn't getting it. Sorry about that!

jameshcorbett commented 3 years ago

Just something I was thinking about: Slurm's sbcast works on both job IDs and step IDs. The proposed implementation, as a job shell plugin, would work only on jobs and wouldn't have a way to broadcast a file to every node in a Flux instance. But you could go up a level in the Flux hierarchy and broadcast the file at that level, to the job that is the sub-instance. There wouldn't be a way to broadcast a file across a top-level Flux instance but I can't think of any use-cases for that.

To replicate the sbcast example in Flux:

$ cat my.job
#!/bin/bash
sbcast my.prog /tmp/my.prog
srun /tmp/my.prog

$ sbatch --nodes=8 my.job
srun: jobid 12345 submitted

You would need to be able to get the job ID of the encapsulating instance and the URI of the system instance. A little awkward, maybe...

grondo commented 3 years ago

The proposed implementation, as a job shell plugin, would work only on jobs and wouldn't have a way to broadcast a file to every node in a Flux instance

Most Flux instances are also jobs, but I think I understand what you are saying here: You can't broadcast a file to all nodes of your single-user enclosing instance (i.e. in most cases, your batch job) from within the instance.

You would need to be able to get the job ID of the encapsulating instance and the URI of the system instance. A little awkward, maybe...

Actually, this may not be too bad. Within an instance started under Flux, the environment variable FLUX_JOB_ID will be set to the jobid of the current instance. The flux(1) command driver also has a --parent option which uses the URI of the parent instead of the current instance. So, if flux bcast is the command to broadcast a file to all nodes of a job, your batch script could use:

flux --parent bcast $FLUX_JOB_ID /tmp/my.prog

Better yet, if a JOBID isn't provided with flux bcast, maybe the utility could assume it is meant to run against the current job and will automatically use the current FLUX_JOB_ID and grab the parent-uri from the enclosing instance so it will work similarly to sbcast:

flux bcast /tmp/my.prog

jameshcorbett commented 3 years ago

Most Flux instances are also jobs, but I think I understand what you are saying here.

The problem with all this infinitely hierarchical stuff is that it makes everything hard to talk about :(

Actually, this may not be too bad.

Great, I figured that there would be good ways of talking to the parent instance, but I didn't know what they were (or if they had already been implemented). I also really like your idea of the missing JOBID assumption.

jameshcorbett commented 3 years ago

Better yet, if a JOBID isn't provided with flux bcast, maybe the utility could assume it is meant to run against the current job and will automatically use the current FLUX_JOB_ID and grab the parent-uri from the enclosing instance.

I like the idea of looking for FLUX_JOB_ID, but grabbing the parent-uri does have one drawback. If you executed something like flux mini run ... bash -c "flux bcast /tmp/my.prog; /tmp/my.prog", FLUX_JOB_ID would be set to the bash job (rather than the job ID of the current instance), so the combination of parent-uri and FLUX_JOB_ID would be all wrong.

Just a potential trade-off to be aware of. I can't think of any possible confusions from letting the job ID be implicit, though.

grondo commented 3 years ago

If you executed something like flux mini run ... bash -c "flux bcast /tmp/my.prog; /tmp/my.prog", FLUX_JOB_ID would be set to the bash job (rather than the job ID of the current instance)

Good point!

Though running flux bcast in this way should perhaps be avoided because:

If your flux mini run specifies multiple tasks you'll be running flux bcast multiple times simultaneously
If your flux mini run only specifies one task then you are running flux bcast to copy a file to itself on the local node

It would be nice if we had a way to detect this situation and issue a meaningful error. :thinking:

If you wanted to run flux bcast as a job, e.g. to use it as part of a workflow, then you could use the FLUX_JOB_ID from the environment at the time of submission, and specifically use --parent, though that isn't so user-friendly:

flux mini submit flux --parent bcast $FLUX_JOB_ID

Sounds like it would be good if we had a way to determine if the current process is in an initial program i.e. batch script, or part of a job. In the 2nd case you could maybe issue an error if JOBID isn't provided.

jameshcorbett commented 2 years ago

@JaeseungYeom do you think you could leverage some of your DYAD work for this?

garlick commented 7 months ago

Closing this. We can open issues against flux-archive(1) if there are still things missing.

flux-framework / flux-core

design shell plugin for file broadcast #3744