flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

design shell plugin for file broadcast #3744

Closed garlick closed 7 months ago

garlick commented 3 years ago

Following up on a coffee time discussion with @jameshcorbett

There is a need for an C API for copying file(s) to a job. (What other requirements are there?)

Some notes from the discussion

Additional thoughts:

garlick commented 3 years ago

Couple of more thoughts:

I hate to even mention this because I really like the idea of having this capability in Flux, but I think this problem could also be solved with mpifileutils or similar with "alloc bypass" from #3740, copying R from the target job. Two advantages of that approach are 1) uses RDMA, and 2) portability.

garlick commented 3 years ago

Following up on coffee call. It doesn't look like mpifileutils dbcast works like I thought it did. It appears to read stripes of a file from all ranks of a parallel job, not one rank. This is what @jameshcorbett was saying and I just wasn't getting it. Sorry about that!

jameshcorbett commented 3 years ago

Just something I was thinking about: Slurm's sbcast works on both job IDs and step IDs. The proposed implementation, as a job shell plugin, would work only on jobs and wouldn't have a way to broadcast a file to every node in a Flux instance. But you could go up a level in the Flux hierarchy and broadcast the file at that level, to the job that is the sub-instance. There wouldn't be a way to broadcast a file across a top-level Flux instance but I can't think of any use-cases for that.

To replicate the sbcast example in Flux:

$ cat my.job
#!/bin/bash
sbcast my.prog /tmp/my.prog
srun /tmp/my.prog

$ sbatch --nodes=8 my.job
srun: jobid 12345 submitted

You would need to be able to get the job ID of the encapsulating instance and the URI of the system instance. A little awkward, maybe...

grondo commented 3 years ago

The proposed implementation, as a job shell plugin, would work only on jobs and wouldn't have a way to broadcast a file to every node in a Flux instance

Most Flux instances are also jobs, but I think I understand what you are saying here: You can't broadcast a file to all nodes of your single-user enclosing instance (i.e. in most cases, your batch job) from within the instance.

You would need to be able to get the job ID of the encapsulating instance and the URI of the system instance. A little awkward, maybe...

Actually, this may not be too bad. Within an instance started under Flux, the environment variable FLUX_JOB_ID will be set to the jobid of the current instance. The flux(1) command driver also has a --parent option which uses the URI of the parent instead of the current instance. So, if flux bcast is the command to broadcast a file to all nodes of a job, your batch script could use:

flux --parent bcast $FLUX_JOB_ID /tmp/my.prog

Better yet, if a JOBID isn't provided with flux bcast, maybe the utility could assume it is meant to run against the current job and will automatically use the current FLUX_JOB_ID and grab the parent-uri from the enclosing instance so it will work similarly to sbcast:

flux bcast /tmp/my.prog
jameshcorbett commented 3 years ago

Most Flux instances are also jobs, but I think I understand what you are saying here.

The problem with all this infinitely hierarchical stuff is that it makes everything hard to talk about :(

Actually, this may not be too bad.

Great, I figured that there would be good ways of talking to the parent instance, but I didn't know what they were (or if they had already been implemented). I also really like your idea of the missing JOBID assumption.

jameshcorbett commented 3 years ago

Better yet, if a JOBID isn't provided with flux bcast, maybe the utility could assume it is meant to run against the current job and will automatically use the current FLUX_JOB_ID and grab the parent-uri from the enclosing instance.

I like the idea of looking for FLUX_JOB_ID, but grabbing the parent-uri does have one drawback. If you executed something like flux mini run ... bash -c "flux bcast /tmp/my.prog; /tmp/my.prog", FLUX_JOB_ID would be set to the bash job (rather than the job ID of the current instance), so the combination of parent-uri and FLUX_JOB_ID would be all wrong.

Just a potential trade-off to be aware of. I can't think of any possible confusions from letting the job ID be implicit, though.

grondo commented 3 years ago

If you executed something like flux mini run ... bash -c "flux bcast /tmp/my.prog; /tmp/my.prog", FLUX_JOB_ID would be set to the bash job (rather than the job ID of the current instance)

Good point!

Though running flux bcast in this way should perhaps be avoided because:

  1. If your flux mini run specifies multiple tasks you'll be running flux bcast multiple times simultaneously
  2. If your flux mini run only specifies one task then you are running flux bcast to copy a file to itself on the local node

It would be nice if we had a way to detect this situation and issue a meaningful error. :thinking:

If you wanted to run flux bcast as a job, e.g. to use it as part of a workflow, then you could use the FLUX_JOB_ID from the environment at the time of submission, and specifically use --parent, though that isn't so user-friendly:

flux mini submit flux --parent bcast $FLUX_JOB_ID 

Sounds like it would be good if we had a way to determine if the current process is in an initial program i.e. batch script, or part of a job. In the 2nd case you could maybe issue an error if JOBID isn't provided.

jameshcorbett commented 2 years ago

@JaeseungYeom do you think you could leverage some of your DYAD work for this?

garlick commented 7 months ago

Closing this. We can open issues against flux-archive(1) if there are still things missing.