It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
266 stars 20 forks source link

Support for giving slurm jobs of workers different names #708

Closed StHagel closed 1 month ago

StHagel commented 1 month ago

Currently, all workers are just named hq-alloc in slurm, when viewing them in squeue. It would be nice to be able to give the workers custom names. In slurm this can be done via the --job-name='My job name' flag.

Kobzol commented 1 month ago

If you execute hq alloc add --name <foo> and then all allocations from this queue would be named <foo>, would that be OK for you? Or should it be e.g. <foo>-1, <foo>-2 etc.?

StHagel commented 1 month ago

Would it also be possible to add a flag to hq submit instead of hq alloc?

Kobzol commented 1 month ago

You can already state the name of a job (hq submit --name <foo>), but this has nothing to do with allocations. Note that HQ jobs are completely separated from allocations, and therefore any attribute of a job cannot affect attributes of Slurm/PBS allocations.

StHagel commented 1 month ago

Right, makes sense.

Then having a --name flag available for hq alloc seems reasonable. Would the index in the name indicate a worker?

Kobzol commented 1 month ago

The index would indicate the order of the allocation created in the given allocation queue. So the first allocation created by HQ would get <foo>-1, the second one would get <foo>-2, etc. In theory, we could also let the allocator name the workers, currently they get their name from the hostname of the node on which they are spawned.

(The flag is already available btw, it just isn't propagated to the Slurm allocation name, which is what we could change).

StHagel commented 1 month ago

I guess the solution with the index is better than without.

Kobzol commented 1 month ago

Oops, according to our documentation, the --name parameter should already have been used to name the allocations, so this was actually a bug. Fixed by https://github.com/It4innovations/hyperqueue/pull/710.