CTI support: shipping files to a running job

ardangelo commented 3 years ago

To support running a tool alongside a job, CTI will ship over the tool's binaries and other support files to the nodes assigned to the user's job. Then, the tool binaries will be started on the nodes and start analysis on the job.

As discussed in https://github.com/flux-framework/flux-core/issues/3291, Flux would need a plugin to support shipping files after a job has been started.

Furthermore, providing a job-specific temporary directory that is cleaned up upon job exit would be very useful in this context, to be able to ship files to a known path on nodes that are reliably cleaned up.

jameshcorbett commented 2 years ago

Status of CTI, from @ardangelo :

We have implemented all the rest of CTI functionality using the Flux API, but are still relying on SSH for the file transfer functionality. While this works in testing, it requires SSH access to compute nodes. On some of our internal systems, and nearly all customer systems, SSH access to nodes is restricted or disabled entirely for normal users. So while we can verify our Flux implementation for CTI, we won't be able to actually use it on systems where SSH access to nodes is restricted. Implementing runtime file shipping in Flux will remove the SSH requirement for CTI's Flux implementation.

Similarly from an earlier email:

CTI shipped with preliminary Flux support in PE 21.12, our December release. As is, it still requires getting SSH keys set up to ship files to nodes, we are just waiting on support for file shipping for running jobs in the Flux API. Once we have that, we'll be able to strip out the SSH key requirement and have Flux fully implemented in CTI. You should still be able to use CTI with Flux for now, but you would need passwordless SSH key access set up to compute nodes for the running job to ship files out correctly.

Motivation for the request, from @ardangelo :

Regarding the file transfers: the point is to make files accessible to the job and its tool daemons... Some of the systems we run on do not have any form of cross-mounted storage to share data between the login / compute nodes, and file shipping allows us to run on these systems. Even on systems that do have shared storage, we don't want to have to run a heuristic to detect the location every time. So, shipping support files to job-specific temporary directories is the way we've decided to go.

I think a function like this would satisfy the requirements.


/*
 * Broadcast the contents of file descriptor 'fd' to a Flux job given by 'id'.
 * The contents of the FD will be written to 'destpath' on every compute node
 * in the job. 
 *  
 * If 'destpath' is a relative path, it will be interpreted as relative to the job's
 * temporary directory (given by the FLUX_JOB_TMPDIR environment variable).
 *
 * The contents will be broken up into chunks of size 'chunksize', in bytes.
 * If 'chunksize' is 0, a reasonable default will be chosen.
 * 
 * 'flags' can be 0 or any OR'd combination of FLUX_JOB_BCAST_PRESERVE and
 * FLUX_JOB_BCAST_OVERWRITE.
 */
flux_future_t *flux_job_bcast (flux_t *h, flux_jobid_t id, int fd, const char *destpath, size_t chunksize, int flags);

A potential implementation is being discussed in https://github.com/flux-framework/flux-core/issues/3744

jameshcorbett commented 2 years ago

Additional complication:

CTI launches tool daemons along side a user's application. For Flux, it does so by launching the tool daemons with alloc-bypass. The files would be broadcast to the user's application job, and then shared with the tool daemons. From an email with Andrew:

We'll need to be able to share files between daemon launches, usually to share common files that have already been shipped.

Combine that requirement with the request in the comment above to place the files in "a known path on nodes that are reliably cleaned up" (i.e. FLUX_JOB_TMPDIR) and I think we would in theory need something like

something * flux_job_tmpdir (flux_t *h, flux_jobid_t id);

where something * is some kind of mapping from shell ranks (or hostnames?) to tmpdirs.

CTI's current workaround is to copy Flux's TMPDIR logic to get FLUX_JOB_TMPDIR :

We get the broker rank and rundir, CTI is watching guest.exec.eventlog for a shell.init event, which ends up containing this information.

Example use case:

As an example, one of our products launches a GDB instance on a node with a running target job, then attaches to the local process for debugging. There are GDB support files that need to be shipped to the job's temporary directory. CTI passes the job's temporary directory containing these files as an argument to GDB, without actually being inside that job. So while the temporary directory is accessible inside the job instance, CTI needs it outside of that context.

jameshcorbett commented 2 years ago

@ardangelo is it expected that the CTI frontend runs on the login node of a cluster? If so, is that the only expected/supported way of running or can the CTI frontend run and launch applications within an allocation (from within a "job" in Slurm terms)?

It's hard for me to tell what's going on looking at the Slurm implementation because with Slurm you can use "srun" both to get an allocation (when invoked on a login node) and to run an application once you already have an allocation (when invoked from within sbatch or salloc).

garlick commented 2 years ago

Since I mentioned this on the call, I'll mention it here too - the flux exec command can be used to run a command across all brokers in an allocation/batch job in parallel. Since it duplicates stdin to all remote tasks, something like this works:

$ tar cf - mpi-test | flux exec tar -C /tmp -xf -

This won't be very scalable since no attempt is made to leverage the tree based overlay network for data fan-out, but it does have the benefit of working out of the box in all modern versions of flux.

ardangelo commented 2 years ago

@jameshcorbett , the CTI frontend usually does run on the login node of a cluster, but our design goals are to support running on any node that can communicate with the WLM and the compute nodes. In the Flux case, we would want to support running both in an allocation, but also from any other shell that can successfully start communication with libflux to get layout information about the target job.

@garlick , thanks for the suggestion Jim, I've tried it out on a single-node Flux setup, and I'm getting about 45MB/s best-case sending a file via this method to the same machine. GDB4hpc is our largest product we transfer, about 200MB all packaged up. I haven't set up a multi-node Flux system, so I'm not able to test that performance across a network. It's not very scalable as you mentioned, but it is more available than the SSH-based file transfer implementation that we are using for Flux now (requires passwordless SSH access to nodes).

For now, I am planning to use SSH as the primary transfer method for the Flux implementation, and fall back to the flux exec implementation if that is unavailable.

jameshcorbett commented 2 years ago

@jameshcorbett , the CTI frontend usually does run on the login node of a cluster, but our design goals are to support running on any node that can communicate with the WLM and the compute nodes. In the Flux case, we would want to support running both in an allocation, but also from any other shell that can successfully start communication with libflux to get layout information about the target job.

Hmmm, I'm thinking it could be a little bit more tricky to make the broadcast work outside of an allocation. Inside a Flux allocation, all you would need would be the jobid. Outside the allocation (on a login node) you would need the jobid of the allocation and the jobid of the actual target job. In that way I guess it's like Slurm job ID and step ID. Sorry if you've already thought about this @ardangelo , I'm not sure what conditions you've tested the implementation under.

If you had both, it might look like flux proxy jobid:$JOBID_OF_ALLOCATION flux bcast $JOBID_OF_TARGET_JOB ... (or Jim's tar/exec line instead of flux bcast) but I will let other Flux developers chime in.

ardangelo commented 2 years ago

It makes sense that they would need to at least be in an allocation to access the Flux API.

If the user is running in another allocation, the job ID of that allocation should be available somewhere in the Flux API, right? So I'm thinking the workflow would be

Get current job ID from Flux API
If job ID matches target job ID, run flux bcast
If job ID doesn't match, run flux proxy .. flux bcast
If there's a Flux environment but no job ID, launch a new Flux job that will run flux proxy .. flux bcast
If there's no Flux environment, report to the user that they must start the attach process inside a Flux shell

ardangelo commented 2 years ago

Redirecting input via flux exec works in the shell, but when launched inside CTI, I'm getting the errors

2022-09-15T14:06:03.224712Z broker.err[0]: channel buffer error: rank = 0 pid = 13014, stream = stdin, len = 1048576: Success
flux-exec: Error: rank 0: cat: Value too large for defined data type
2022-09-15T14:06:03.255490Z broker.err[0]: server_write_cb: lookup_pid: No such file or directory

I'm using Flux 0.40.0-15, it happens with cat, sed, and a minimal C program that redirects input. Haven't seen this before in CTI when launching other programs, but it could be something with the input redirection.

jameshcorbett commented 1 year ago

ardangelo commented 1 year ago

Thanks for the update on that. So for CTI, we would ship a file to somewhere accessible from the rank 0 broker, then use flux filemap to make it accessible, and flux filemap get on the other ranks?

garlick commented 1 year ago

That's essentially it. The rank 0 broker in this case is the first node of the batch/alloc job, whcih is a flux instance owned by the user. It's not necessary to use flux filemap get directly. If preferred, there's a shell option that can be used with flux run or flux submit that would fetch file(s) on each rank of a job within the batch job.

There are a couple examples at the bottom of the man page for flux-filemap(1).

garlick commented 7 months ago

flux-filemap was replaced with flux-archive which no longer has the rank 0 constraint. Cray is aware of the change and is presumably reworking the CTI implementation for flux. Closing this.

flux-framework / flux-core

CTI support: shipping files to a running job #3631