Properties of HyperQueue

Kobzol commented 1 month ago

Hi! I noticed the High throughput page in CSC docs (https://docs.csc.fi/computing/running/throughput/), and I have some comments on the properties of HyperQueue (I'm one of the HQ developers, as a disclaimer).

It seems that the page mentions HyperQueue as a primary alternative to GNU parallel or Slurm array jobs, which is fine, however that's just one of the interfaces for using HQ. In fact, HQ supports (and was explicitly designed) for supporting many of the things mentioned on the page, even though the page claims that it does not support them :)

Here are a few things that I'd like to clarify, regarding the "decision tree" and the comparison table:

Multi-node tasks (tree) - HQ does in fact support multi-node tasks, and allows you to even combine single node and multinode tasks in the same task graph.
Dependencies between subtasks (tree) - HQ does support dependencies between tasks, they can be expressed either using workflow files or using a Python API.
Packs jobs/job steps (table) - The primary motivation of HQ is to allow users to submit large task graphs (e.g. a million tasks) and then fully automatically map these tasks to a small number of Slurm/PBS allocations. In fact, "job packing" was the main reason why HQ was even created :)
Dependency support (table) - As mentioned above, HQ allows expressing dependencies between tasks.
Error recovery - Fault-tolerant task execution is built into HQ. When a task fails, it will get fully automatically recomputed. So it seems weird to me that this is marked as "not supported" :)
Slurm integration - HQ has an automatic allocator that is able to submit allocations on the behalf of users fully automatically, based on computational needs of submitted tasks. It can integrate and communicate with Slurm on users' behalf.
Multi-partition support - HQ has very advanced resource management. You can specify arbitrary resource requirements per task, e.g. "needs 16 CPUs" or "needs 2 NVIDIA GPUs" or even "needs 0.25 NVIDIA GPU" or "needs either 4 CPUs AND 1 GPU OR needs 16 CPUs". You can add HQ workers from arbitrary amount of different Slurm partitions and HQ will schedule tasks to them fully transparently, you can even configure the automatic allocator to provide you with allocations from a CPU partition and a GPU partition at the same time.

While HQ can be used "just" as a task executor within a single Slurm allocation, it can be much more powerful when it is used as a meta-scheduler, i.e. users just run the server on a login node and let HQ manage Slurm allocations for them fully automatically.

I hope that this description makes the set of offered HQ features a bit more accurate :) I can send a PR that clarifies these points in your docs if you want.

rkronberg commented 1 month ago

Hi @Kobzol ! Good points here, thanks a lot for the feedback.

Just as a quick respone, some of the features marked as "partial" or "unsupported" may also be due to the way we recommend users to use HQ on CSC servers. So maybe a better formulation would be "unsupported at CSC", although the feature is technically supported in reality.

Nonetheless, we'll take closer look at these suggestions and improve/clarify our documentation accordingly. Thanks again!

Kobzol commented 1 month ago

Indeed, I think that the page mostly treats HQ as a "better GNU Parallel". In that case, I would still modify some descriptions (e.g. fault-tolerance, job packing, multi-partition support, perhaps Slurm integration), which are also relevant in this use-case.

On the other hand, HQ is a fully general DAG execution engine if you take its TOML workflow files and/or Python API into account, where it has a lot of similarities to SnakeMake/Fireworks/Nextflow/Dask/etc. It's fine if you don't want to accent these features (dependencies, multi-node tasks, etc.) on this page though.

Kobzol commented 1 month ago

By the way, we would also like to understand the reasoning behind this hint more. Most of our users use the automatic allocator by default; going through the dance of creating a Slurm allocation manually and starting the HQ infrastructure inside of it is considered to be unnecessarily "low-level" by us, but your guide page seemingly approaches this from the exact opposite angle :)

CSCfi / csc-user-guide

Properties of HyperQueue #2174