CSCfi / csc-user-guide

User guides, FAQs and tutorials related to CSC services
https://docs.csc.fi
Creative Commons Attribution 4.0 International
46 stars 80 forks source link

Properties of HyperQueue #2174

Open Kobzol opened 1 month ago

Kobzol commented 1 month ago

Hi! I noticed the High throughput page in CSC docs (https://docs.csc.fi/computing/running/throughput/), and I have some comments on the properties of HyperQueue (I'm one of the HQ developers, as a disclaimer).

It seems that the page mentions HyperQueue as a primary alternative to GNU parallel or Slurm array jobs, which is fine, however that's just one of the interfaces for using HQ. In fact, HQ supports (and was explicitly designed) for supporting many of the things mentioned on the page, even though the page claims that it does not support them :)

Here are a few things that I'd like to clarify, regarding the "decision tree" and the comparison table:

While HQ can be used "just" as a task executor within a single Slurm allocation, it can be much more powerful when it is used as a meta-scheduler, i.e. users just run the server on a login node and let HQ manage Slurm allocations for them fully automatically.

I hope that this description makes the set of offered HQ features a bit more accurate :) I can send a PR that clarifies these points in your docs if you want.

rkronberg commented 1 month ago

Hi @Kobzol ! Good points here, thanks a lot for the feedback.

Just as a quick respone, some of the features marked as "partial" or "unsupported" may also be due to the way we recommend users to use HQ on CSC servers. So maybe a better formulation would be "unsupported at CSC", although the feature is technically supported in reality.

Nonetheless, we'll take closer look at these suggestions and improve/clarify our documentation accordingly. Thanks again!

Kobzol commented 1 month ago

Indeed, I think that the page mostly treats HQ as a "better GNU Parallel". In that case, I would still modify some descriptions (e.g. fault-tolerance, job packing, multi-partition support, perhaps Slurm integration), which are also relevant in this use-case.

On the other hand, HQ is a fully general DAG execution engine if you take its TOML workflow files and/or Python API into account, where it has a lot of similarities to SnakeMake/Fireworks/Nextflow/Dask/etc. It's fine if you don't want to accent these features (dependencies, multi-node tasks, etc.) on this page though.

Kobzol commented 1 month ago

By the way, we would also like to understand the reasoning behind this hint more. Most of our users use the automatic allocator by default; going through the dance of creating a Slurm allocation manually and starting the HQ infrastructure inside of it is considered to be unnecessarily "low-level" by us, but your guide page seemingly approaches this from the exact opposite angle :)