Running PRRTE inside a Flux allocation

rhc54 commented 3 years ago

In some circumstances, users might want to be able to run an application underneath a PMIx server while operating in a Flux-based environment. I recognize that eventually Flux might directly provide that service, but (a) there is the interim situation to consider, and (b) Flux might choose not to support the full breadth of PMIx services.

We do see users operating this way on other environments (e.g., Cray, which is the most common case). Typically, users do this to utilize some PMIx feature beyond simple wireup - e.g., the event notification subsystem for fault tolerance, or the PMIx group support for MPI Sessions.

I'd be willing to implement the necessary PRRTE plugins to enable this usage, but I would need to know the following:

How do I obtain knowledge of the allocation? Some RMs provide an envar that exposes the allocation (often in some regular expression format), while others put the allocation in a file or provide an accessor function to query it.
Is there a mechanism by which I can have Flux spawn the PRRTE daemons on remote nodes within the allocation? PRRTE basically creates its own distributed virtual machine underneath Flux that will fork/exec the application's procs and provide the PMIx services on each node. In the absence of a Flux mechanism, we fall back to simple ssh, so that is always an option. However, it can be more efficient if there is something equivalent to Slurm's srun that we can call for this purpose.
Is there a simple way to detect that Flux is installed on the cluster? This is for configure purposes - just looking for a header that we can use to determine that we should build this support.

If you can point me to any documentation on these subjects, I can take it from there (at least for most of the way - I don't have access to a Flux-based machine for testing, but I'm sure I can find someone who can help in that regard).

SteVwonder commented 3 years ago

How do I obtain knowledge of the allocation?

Some of that info is available through environment variables (size, rank, nnodes, etc).

ƒ(s=1,d=0) fluxuser@e5a81809a450:~$ flux mini run printenv | grep FLUX
FLUX_CONNECTOR_PATH=/usr/lib/flux/connectors
FLUX_MODULE_PATH=/usr/lib/flux/modules
FLUX_EXEC_PATH=/usr/libexec/flux/cmd
FLUX_PMI_LIBRARY_PATH=/usr/lib/flux/libpmi.so
FLUX_TERMINUS_SESSION=0
FLUX_TASK_LOCAL_ID=0
FLUX_TASK_RANK=0
FLUX_JOB_SIZE=1
FLUX_JOB_NNODES=1
FLUX_JOB_ID=ƒ3qVBKRy
FLUX_URI=local:///tmp/flux-ob8THD/0/local
FLUX_KVS_NAMESPACE=job-107961384960

The hostlist is accessible via our job-info service. You can send an RPC via the C and Python Flux APIs to grab this info if you want. Below is a CLI example (for simplicity of demonstration):

ƒ(s=1,d=0) fluxuser@e5a81809a450:~$ flux job info ƒ3qVBKRy R
{"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}],"starttime":0.0,"expiration":0.0,"nodelist":["e5a81809a450"]}}

Let me know if there is any other info about the allocation that you need to grab.

Is there a mechanism by which I can have Flux spawn the PRRTE daemons on remote nodes within the allocation?

Just to make sure I'm on the same page, on something like Slurm or a Cray the user first srun prted and then submit their job to prrte via prun, right?

However, it can be more efficient if there is something equivalent to Slurm's srun that we can call for this purpose.

The flux analog to srun is flux mini run. So maybe flux mini run prted fits your needs?

Is there a simple way to detect that Flux is installed on the cluster? This is for configure purposes - just looking for a header that we can use to determine that we should build this support.

Yep. Flux-core installs a flux.h header in $PREFIX/flux/core/:

ƒ(s=1,d=0) fluxuser@e5a81809a450:~$ find /usr/include/ -name "flux.h"
/usr/include/flux/core/flux.h

I don't have access to a Flux-based machine for testing, but I'm sure I can find someone who can help in that regard

Flux can easily run in userspace. We also have a docker image for easy dev/testing locally:

> docker run -ti fluxrm/flux-core:latest

All of the examples above are from that docker image. You can easily start a multi-rank Flux instance on a single machine (for testing) with flux start -s $SIZE.

grondo commented 3 years ago

You can easily start a multi-rank Flux instance on a single machine (for testing) with flux start -s $SIZE.

You can also run Flux as a job under Slurm, e.g.

$ srun --pty [OPTIONS] /path/to/flux start

So in a sense, any cluster can be a Flux cluster.

rhc54 commented 3 years ago

Let me know if there is any other info about the allocation that you need to grab.

I think that should cover it - mostly just need the list of node names/ids

Just to make sure I'm on the same page, on something like Slurm or a Cray the user first srun prted and then submit their job to prrte via prun, right?

Not exactly. Under Slurm, the sequence usually is:

$ salloc <number of nodes you want>
$ prte &
$ prun ...
$ pterm (when done)

prte autosenses what environment it is operating under, obtains the allocation info, and then internally calls srun to launch one prted daemon on each allocated node. Those daemons wireup back to prte, which then sends launch commands to those daemons whenever someone executes a prun. Once Slurm launches those prted daemons, it no longer has any functional involvement in the execution of the applications.

If someone instead uses srun prte to start the DVM, the problem is that you only want one instance of prte to be executed, and that means Slurm will place prte into a 1-slot step. prte then senses it only has that one slot to work with, which isn't what the user intended. Hence the above steps.

I suspect you have something equivalent to salloc to get an allocation?

The flux analog to srun is flux mini run. So maybe flux mini run prted fits your needs?

Sounds like it might - I'll dig into that option.

So in a sense, any cluster can be a Flux cluster.

Any chance it will run on a Mac? 😄

Thanks for the assist! I can't promise how soon I'll have this operational as I have some other obligations first, but I'll keep you posted here.

rhc54 commented 3 years ago

Missed something I should have asked about:

You can send an RPC via the C and Python Flux APIs

Are those APIs GPL? If so, is there a non-GPL way of getting that info - e.g., a cmd line tool that will report the node list so I can "scrape" it from the output?

grondo commented 3 years ago

Are those APIs GPL? If so, is there a non-GPL way of getting that info - e.g., a cmd line tool that will report the node list so I can "scrape" it from the output?

The APIs are LGPL. The cmd line tool that will help you process flux job info JOBID R is flux R decode:

$ flux job info ƒ2PSQLxsCK R | flux R decode --nodelist
asp

BTW what we call R is described in full in RFC 20/Resource Set Specification V1

grondo commented 3 years ago

The flux analog to srun is flux mini run. So maybe flux mini run prted fits your needs? Sounds like it might - I'll dig into that option.

flux mini run prte will allocate resources from the Flux instance though. I assume you want a way to launch daemons without allocating resources. You might look into flux exec or its C API equivalent for that purpose.

rhc54 commented 3 years ago

You might look into flux exec

Good catch @grondo ! I'll do that.

grondo commented 3 years ago

I suspect you have something equivalent to salloc to get an allocation?

Not sure how familiar you are with Flux, @rhc54. In Flux, though all jobs get an "allocation" of resources from their parent, the Flux sbatch and salloc equivalents (e.g. flux mini alloc, flux mini batch) spawn a new Flux instance which then allows scheduling, submission of new jobs and even new sub-allocations.

Any chance it will run on a Mac?

Unfortunately, not natively. But many developers use docker on Macs successfully.

rhc54 commented 3 years ago

Not sure how familiar you are with Flux, @rhc54. In Flux, though all jobs get an "allocation" of resources from their parent, the Flux sbatch and salloc equivalents (e.g. flux mini alloc, flux mini batch) spawn a new Flux instance which then allows scheduling, submission of new jobs and even new sub-allocations.

I honestly haven't used it yet - appreciate the clarification.

But many developers use docker on Macs successfully.

Me too - easy enough to go that way.

SteVwonder commented 3 years ago

You might look into flux exec or its C API equivalent for that purpose.

This could also be a good use case for the scheduler bypass functionality added in https://github.com/flux-framework/flux-core/pull/3740. If you are running prte under a nested instance of Flux, you could easily make the requested resource set of the prte job exactly equal to the allocated resource set of the nested Flux instance.

flux-framework / flux-core

Running PRRTE inside a Flux allocation #3539