It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
266 stars 20 forks source link

Number of workers is lager than the sum of "workers per alloc" #705

Closed unkcpz closed 1 month ago

unkcpz commented 1 month ago

The behavior the alloc assign workers is not clear to me. I am using version 0.12.0. Here is the output of my hq worker list:

[eiger][jyu@eiger-ln004 ~]$ hq worker list
+----+---------+-----------+---------------------------+---------+----------------+
| ID | State   | Hostname  | Resources                 | Manager | Manager Job ID |
+----+---------+-----------+---------------------------+---------+----------------+
| 34 | RUNNING | nid001259 | 8x16 cpus; mem 503.30 GiB | SLURM   | 3083466        |
| 35 | RUNNING | nid001512 | 8x16 cpus; mem 503.30 GiB | SLURM   | 3083468        |
| 37 | RUNNING | nid001260 | 8x16 cpus; mem 503.30 GiB | SLURM   | 3083467        |
| 44 | RUNNING | nid002237 | 8x16 cpus; mem 503.30 GiB | SLURM   | 3083481        |
| 45 | RUNNING | nid001582 | 8x16 cpus; mem 503.30 GiB | SLURM   | 3083488        |
| 46 | RUNNING | nid001584 | 8x16 cpus; mem 503.30 GiB | SLURM   | 3083490        |
| 48 | RUNNING | nid001585 | 8x16 cpus; mem 503.30 GiB | SLURM   | 3083492        |
| 49 | RUNNING | nid001583 | 8x16 cpus; mem 503.30 GiB | SLURM   | 3083496        |
| 50 | RUNNING | nid002157 | 8x16 cpus; mem 503.30 GiB | SLURM   | 3083497        |
| 51 | RUNNING | nid002158 | 8x16 cpus; mem 503.30 GiB | SLURM   | 3083498        |
| 52 | RUNNING | nid002159 | 8x16 cpus; mem 503.30 GiB | SLURM   | 3083499        |
+----+---------+-----------+---------------------------+---------+----------------+

and my hq alloc list:

+----+--------------+-------------------+-----------+---------+-------+------------------------------------+                                 
| ID | Backlog size | Workers per alloc | Timelimit | Manager | Name  | Args                               |                                 
+----+--------------+-------------------+-----------+---------+-------+------------------------------------+                                 
| 4  | 1            | 1                 | 30m       | SLURM   | aiida | -A,mr32,-C,mc,-p,debug,--mem,497G  |                                 
| 5  | 1            | 1                 | 1day      | SLURM   | aiida | -A,mr32,-C,mc,-p,normal,--mem,497G |                                 
| 6  | 1            | 1                 | 1day      | SLURM   | aiida | -A,mr32,-C,mc,-p,normal,--mem,497G |                                 
| 7  | 1            | 1                 | 1day      | SLURM   | aiida | -A,mr32,-C,mc,-p,normal,--mem,497G |                                 
| 8  | 1            | 1                 | 1day      | SLURM   | aiida | -A,mr32,-C,mc,-p,normal,--mem,497G |                                 
+----+--------------+-------------------+-----------+---------+-------+------------------------------------+   

If I understand correctly, the number should match or not exceed to the total workers allocated from alloc. It also not clear to me when will the worker being collected if no more job is submit to it.

unkcpz commented 1 month ago

It would be also useful if there is command can be used to check the load of workers. My calculations can runs 4 on a single node, but it seems not fill the resource.

Kobzol commented 1 month ago

Hi, to answer your second question first, this is determined by idle timeout. It is a duration after which a worker shuts down if it didn't receive any new jobs. When using automatic allocation, the default idle timeout interval is five minutes. So if you have an allocation with a worker and it doesn't receive anything to compute for five minutes, it will shut itself (and the allocation) down.

Regarding user load, we have a dashboard that you can run using hq dashboard, however it is only available in the latest release (0.18). It is also currently very experimental.

Regarding the worker list table, the output of hq alloc list can be very inaccurate. HQ is extremely conservative in invoking PBS/Slurm commands that return the current status of the queue, because in the past when we were running them without any limits, it was overloading the system schedulers. So HQ only asks once in an hour or so, and finds out about allocations mostly when a new worker connects to it. It's possible that with a newer version of HQ this would be better, maybe there were some fixes (0.12 is quite old).

unkcpz commented 1 month ago

Thanks! It is a clear explanation and I'll try with the new version.