It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
266 stars 20 forks source link

Support for allocations on several different job managers #695

Open npf opened 2 months ago

npf commented 2 months ago

Hi,

In https://it4innovations.github.io/hyperqueue/stable/deployment/allocation/, the documentation says :

You can create multiple allocation queues, and you can even combine PBS queues with Slurm queues.

Does that mean hyperqueue can auto-allocate on several HPC clusters with different submission frontends?

Technically speaking, I do not see how to configure the necessary remote access to submission frontends. In the code, the allocation function calls either sbatch or qsub directly, if I'm correct.

Should both sbatch or qsub commands be available on the machine where the hyerqueue server runs?

Thanks.

Kobzol commented 2 months ago

Hi!

Does that mean hyperqueue can auto-allocate on several HPC clusters with different submission frontends?

It does, although to actually support two different clusters, you'll need to run the HQ server on a place that is accessible from both clusters (and their compute nodes) through TCP/IP, which might be a bit challenging. Also, if you want to use automatic allocation for this, it's a bit more complex (see below).

Technically speaking, I do not see how to configure the necessary remote access to submission frontends. In the code, the allocation function calls either sbatch or qsub directly, if I'm correct.

It does indeed call sbatch/qsub directly. We have been thinking about providing some way to customize this mechanism, but we haven't seen any use-case for that yet. A simpler solution/workaround might be to provide a proxy, that will reroute the sbatch/qsub calls from the node where HQ server is deployed to the corresponding login nodes/frontends. You could probably write e.g. a simple Python program that will act as sbatch/qsub and allow communicating with remote systems.

If you had a use-case for this, we could also implement e.g. a JSON-based auto allocation backend, which could implement the autoallocation using any mechanism it would need.

Should both sbatch or qsub commands be available on the machine where the hyerqueue server runs?

Currently, yes, if you want to use auto-allocation (or you can use a proxy as described above).

If you don't use automatic allocation, you can also just provide the computational resources to HQ manually, by running sbatch/qsub on the corresponding clusters, and then redirecting the HQ workers to the IP address of the HQ server. In that case the server does not need to know anything about sbatch/qsub.