Open alexandervaneck opened 3 years ago
This proposal sounds totally reasonable to me. Do you have any interest in raising a PR?
Very much! I'm only waiting on the infrastructure that I have access to to enable the SLURM REST API, so I can test it.
Would you have time for a review sometime this week/next week @jacobtomlinson?
Happy to review but I think it best for one of the core maintainers of this repo (@guillaumeeb, @lesteve) to do the final merge.
This sounds also totally reasonable to me, as it keeps the general concept of dask-jobqueue and should be pretty readable by the look of your snippet!
This is really nice to have a REST API for submitting job, Slurm is definitely a nice job scheduler.
So waiting for your PR 👍 !
Just curious, your use case is to create a SlumRemoteCluster
on a host
sbatch
does not exist) but you have access to the Slurm REST API endpointDid I get this right?
If so it does not feel like this is a very common situation but I may be missing something of course ...
@lesteve thank you for responding 🙇♂️ I've been poking around dask-jobqueue for a few days now and am very happy to see it's been very well maintained. Thank you for this.
Just curious, your use case is to create a
SlumRemoteCluster
on a host
- where Slurm is not installed (for example
sbatch
does not exist) but you have access to the Slurm REST API endpoint- which is in the same network as the Slurm cluster (I don't know the exact technical term but what I mean is that the Dask scheduler on this host need to communicate over HTTP to the Dask workers on your computing nodes)
Did I get this right?
If so it does not feel like this is a very common situation but I may be missing something of course ...
Yes, that would be correct.
I wouldn't know how to determine if this is a common situation. However I would argue that since SLURM has introduced an REST API to be able to remotely start jobs there must be some users.
The usecase would be allowing a docker container running inside a HPC cluster to call out to SLURM to schedule dask-workers with specific (GPU) resource requirements.
Inside the docker container;
OK running the Dask scheduler inside a docker container (on a login node I assume) is a use case that makes sense. I did not think of this, thanks!
For security reasons I would think that cluster sys-admins would not allow connecting to the Slurm REST API endpoint from the outside, but maybe these kind of security constraints are only in place for "big" clusters.
I had in mind the ideal setup (unfortunately not possible easily as far as I know ...) where your Dask scheduler lives outside of the cluster and the Dask workers live inside the cluster. For example see https://github.com/dask/dask-jobqueue/issues/471 with more details.
inside a docker container (on a login node I assume)
Yes - or at least somewhere where it can send/receive calls from the SLURM REST API. I would say "inside" the cluster.
Has there been any progress here?
inside a docker container (on a login node I assume)
this is exactly my use case (except with a Singularity container, but it still means I don't have the sbatch binary directly available).
This could be of interest as well: https://gist.github.com/willirath/2176a9fa792577b269cb393995f43dda
It's ssh'ing back to the host system where srun etc are available.
Has there been any progress here?
Unfortunately the associated PR has gone stale... But there was some work on it, so if anyone want to keep going it would be nice!
+1 for this.
+1
+1
Hello 👋 Thank you for considering this feature request :) I have been looking over dask-jobqueue (together with prefect) to allocate resources on a Slurm cluster I have access to. dask-jobqueue seems exactly what we'd need for this, thank you for maintaining it 🙇
Context
In the case where Slurm and a python process (script or notebook) are not running on the same host SlurmCluster will not be able to spawn any jobs and error.
Slurm added a REST API: https://slurm.schedmd.com/rest_api.html
Feature
Could we add a
RemoteSlurmCluster
andRemoteSlurmJob
that largely extend SlurmCluster/SlurmJob and instead of usingsubprocess.Popen
we'd do an HTTP request instead?As far as I can tell this should be a drop-in replacement.
Thoughts? (tagging @mrocklin @lesteve for visibility, hope you would have time for a review.)