Open rhc54 opened 3 years ago
How do I obtain knowledge of the allocation?
Some of that info is available through environment variables (size, rank, nnodes, etc).
ƒ(s=1,d=0) fluxuser@e5a81809a450:~$ flux mini run printenv | grep FLUX
FLUX_CONNECTOR_PATH=/usr/lib/flux/connectors
FLUX_MODULE_PATH=/usr/lib/flux/modules
FLUX_EXEC_PATH=/usr/libexec/flux/cmd
FLUX_PMI_LIBRARY_PATH=/usr/lib/flux/libpmi.so
FLUX_TERMINUS_SESSION=0
FLUX_TASK_LOCAL_ID=0
FLUX_TASK_RANK=0
FLUX_JOB_SIZE=1
FLUX_JOB_NNODES=1
FLUX_JOB_ID=ƒ3qVBKRy
FLUX_URI=local:///tmp/flux-ob8THD/0/local
FLUX_KVS_NAMESPACE=job-107961384960
The hostlist is accessible via our job-info service. You can send an RPC via the C and Python Flux APIs to grab this info if you want. Below is a CLI example (for simplicity of demonstration):
ƒ(s=1,d=0) fluxuser@e5a81809a450:~$ flux job info ƒ3qVBKRy R
{"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}],"starttime":0.0,"expiration":0.0,"nodelist":["e5a81809a450"]}}
Let me know if there is any other info about the allocation that you need to grab.
Is there a mechanism by which I can have Flux spawn the PRRTE daemons on remote nodes within the allocation?
Just to make sure I'm on the same page, on something like Slurm or a Cray the user first srun prted
and then submit their job to prrte
via prun
, right?
However, it can be more efficient if there is something equivalent to Slurm's srun that we can call for this purpose.
The flux analog to srun
is flux mini run
. So maybe flux mini run prted
fits your needs?
Is there a simple way to detect that Flux is installed on the cluster? This is for configure purposes - just looking for a header that we can use to determine that we should build this support.
Yep. Flux-core installs a flux.h
header in $PREFIX/flux/core/
:
ƒ(s=1,d=0) fluxuser@e5a81809a450:~$ find /usr/include/ -name "flux.h"
/usr/include/flux/core/flux.h
I don't have access to a Flux-based machine for testing, but I'm sure I can find someone who can help in that regard
Flux can easily run in userspace. We also have a docker image for easy dev/testing locally:
> docker run -ti fluxrm/flux-core:latest
All of the examples above are from that docker image. You can easily start a multi-rank Flux instance on a single machine (for testing) with flux start -s $SIZE
.
You can easily start a multi-rank Flux instance on a single machine (for testing) with flux start -s $SIZE.
You can also run Flux as a job under Slurm, e.g.
$ srun --pty [OPTIONS] /path/to/flux start
So in a sense, any cluster can be a Flux cluster.
Let me know if there is any other info about the allocation that you need to grab.
I think that should cover it - mostly just need the list of node names/ids
Just to make sure I'm on the same page, on something like Slurm or a Cray the user first srun prted and then submit their job to prrte via prun, right?
Not exactly. Under Slurm, the sequence usually is:
$ salloc <number of nodes you want>
$ prte &
$ prun ...
$ pterm (when done)
prte
autosenses what environment it is operating under, obtains the allocation info, and then internally calls srun
to launch one prted
daemon on each allocated node. Those daemons wireup back to prte
, which then sends launch commands to those daemons whenever someone executes a prun
. Once Slurm launches those prted
daemons, it no longer has any functional involvement in the execution of the applications.
If someone instead uses srun prte
to start the DVM, the problem is that you only want one instance of prte
to be executed, and that means Slurm will place prte
into a 1-slot step. prte
then senses it only has that one slot to work with, which isn't what the user intended. Hence the above steps.
I suspect you have something equivalent to salloc
to get an allocation?
The flux analog to srun is flux mini run. So maybe flux mini run prted fits your needs?
Sounds like it might - I'll dig into that option.
So in a sense, any cluster can be a Flux cluster.
Any chance it will run on a Mac? 😄
Thanks for the assist! I can't promise how soon I'll have this operational as I have some other obligations first, but I'll keep you posted here.
Missed something I should have asked about:
You can send an RPC via the C and Python Flux APIs
Are those APIs GPL? If so, is there a non-GPL way of getting that info - e.g., a cmd line tool that will report the node list so I can "scrape" it from the output?
Are those APIs GPL? If so, is there a non-GPL way of getting that info - e.g., a cmd line tool that will report the node list so I can "scrape" it from the output?
The APIs are LGPL. The cmd line tool that will help you process flux job info JOBID R
is flux R decode
:
$ flux job info ƒ2PSQLxsCK R | flux R decode --nodelist
asp
BTW what we call R is described in full in RFC 20/Resource Set Specification V1
The flux analog to srun is flux mini run. So maybe flux mini run prted fits your needs? Sounds like it might - I'll dig into that option.
flux mini run prte
will allocate resources from the Flux instance though. I assume you want a way to launch daemons without allocating resources. You might look into flux exec
or its C API equivalent for that purpose.
You might look into flux exec
Good catch @grondo ! I'll do that.
I suspect you have something equivalent to salloc to get an allocation?
Not sure how familiar you are with Flux, @rhc54. In Flux, though all jobs get an "allocation" of resources from their parent, the Flux sbatch
and salloc
equivalents (e.g. flux mini alloc
, flux mini batch
) spawn a new Flux instance which then allows scheduling, submission of new jobs and even new sub-allocations.
Any chance it will run on a Mac? 
Unfortunately, not natively. But many developers use docker on Macs successfully.
Not sure how familiar you are with Flux, @rhc54. In Flux, though all jobs get an "allocation" of resources from their parent, the Flux sbatch and salloc equivalents (e.g. flux mini alloc, flux mini batch) spawn a new Flux instance which then allows scheduling, submission of new jobs and even new sub-allocations.
I honestly haven't used it yet - appreciate the clarification.
But many developers use docker on Macs successfully.
Me too - easy enough to go that way.
You might look into flux exec or its C API equivalent for that purpose.
This could also be a good use case for the scheduler bypass functionality added in https://github.com/flux-framework/flux-core/pull/3740. If you are running prte under a nested instance of Flux, you could easily make the requested resource set of the prte job exactly equal to the allocated resource set of the nested Flux instance.
In some circumstances, users might want to be able to run an application underneath a PMIx server while operating in a Flux-based environment. I recognize that eventually Flux might directly provide that service, but (a) there is the interim situation to consider, and (b) Flux might choose not to support the full breadth of PMIx services.
We do see users operating this way on other environments (e.g., Cray, which is the most common case). Typically, users do this to utilize some PMIx feature beyond simple wireup - e.g., the event notification subsystem for fault tolerance, or the PMIx group support for MPI Sessions.
I'd be willing to implement the necessary PRRTE plugins to enable this usage, but I would need to know the following:
ssh
, so that is always an option. However, it can be more efficient if there is something equivalent to Slurm'ssrun
that we can call for this purpose.If you can point me to any documentation on these subjects, I can take it from there (at least for most of the way - I don't have access to a Flux-based machine for testing, but I'm sure I can find someone who can help in that regard).