It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
275 stars 21 forks source link

Question about max-worker-count #654

Open svatosFZU opened 9 months ago

svatosFZU commented 9 months ago

Hi, I am trying to limit the rate at which out allocation is depleted. For that, I thought the max-worker-count option of allocation queue would be the way to do it. Now, watching it run, I have few questions about it. 1.) Is there a way to see that the option was propagated to the HQ? The hq alloc list does not show it:

[svatosm@login1.barbora ~]$ /home/svatosm/hq-v0.16.0-linux-x64/hq alloc list
+----+---------+--------------+-------------------+-----------+---------+------+-------------------------+
| ID | State   | Backlog size | Workers per alloc | Timelimit | Manager | Name | Args                    |
+----+---------+--------------+-------------------+-----------+---------+------+-------------------------+
| 5  | RUNNING | 1            | 1                 | 2days     | SLURM   |      | -AOPEN-29-6,-pqcpu,-c36 |
+----+---------+--------------+-------------------+-----------+---------+------+-------------------------+

and hq alloc info 5 does not show details of the allocation setting.

2.) How does it actually work? According to the docs (https://it4innovations.github.io/hyperqueue/v0.16.0/deployment/allocation/#max-worker-count), it should set maximum number of workers that can be queued or running. So, I set the allocation queu like this:

/home/svatosm/hq-v0.16.0-linux-x64/hq alloc add slurm --max-worker-count 10 --backlog 1 --time-limit 48h --worker-time-limit 47h59m --worker-start-cmd "/scratch/project/open-29-6/session/ACmount.sh" --worker-stop-cmd "/scratch/project/open-29-6/session/ACumount.sh" -- -AOPEN-29-6 -pqcpu -c36

With one allocation queue running 1 worker per allocation and maximum of 10 workers, I assumed that I would have maximum of 10 running workers/jobs. But looking at the number of running jobs, I see I have 24 of them:

[svatosm@login1.barbora ~]$ /home/svatosm/hq-v0.16.0-linux-x64/hq job list
+------+------+---------+-------+
| ID   | Name | State   | Tasks |
+------+------+---------+-------+
| 3444 | bash | RUNNING | 1     |
| 3445 | bash | RUNNING | 1     |
| 3446 | bash | RUNNING | 1     |
| 3448 | bash | RUNNING | 1     |
| 3449 | bash | RUNNING | 1     |
| 3450 | bash | RUNNING | 1     |
| 3451 | bash | RUNNING | 1     |
| 3452 | bash | RUNNING | 1     |
| 3453 | bash | WAITING | 1     |
| 3454 | bash | RUNNING | 1     |
| 3455 | bash | RUNNING | 1     |
| 3456 | bash | RUNNING | 1     |
| 3457 | bash | RUNNING | 1     |
| 3458 | bash | RUNNING | 1     |
| 3459 | bash | RUNNING | 1     |
| 3461 | bash | WAITING | 1     |
| 3462 | bash | RUNNING | 1     |
| 3463 | bash | RUNNING | 1     |
| 3464 | bash | RUNNING | 1     |
| 3465 | bash | RUNNING | 1     |
| 3466 | bash | WAITING | 1     |
| 3467 | bash | WAITING | 1     |
| 3468 | bash | RUNNING | 1     |
| 3469 | bash | RUNNING | 1     |
| 3470 | bash | WAITING | 1     |
| 3472 | bash | WAITING | 1     |
| 3476 | bash | WAITING | 1     |
| 3478 | bash | WAITING | 1     |
| 3479 | bash | RUNNING | 1     |
| 3480 | bash | WAITING | 1     |
| 3481 | bash | RUNNING | 1     |
| 3482 | bash | WAITING | 1     |
| 3483 | bash | RUNNING | 1     |
| 3484 | bash | RUNNING | 1     |
| 3485 | bash | WAITING | 1     |
| 3486 | bash | WAITING | 1     |
| 3487 | bash | WAITING | 1     |
| 3488 | bash | WAITING | 1     |
| 3489 | bash | WAITING | 1     |
| 3490 | bash | WAITING | 1     |
| 3491 | bash | WAITING | 1     |
| 3492 | bash | WAITING | 1     |
| 3493 | bash | WAITING | 1     |
| 3494 | bash | WAITING | 1     |
| 3495 | bash | WAITING | 1     |
| 3496 | bash | WAITING | 1     |
| 3497 | bash | WAITING | 1     |
+------+------+---------+-------+
There are 3497 jobs in total. Use `--all` to display all jobs.

I see the same when listing workers:

[svatosm@login1.barbora ~]$ /home/svatosm/hq-v0.16.0-linux-x64/hq worker list
+------+---------+-----------------------+---------------------------+---------+----------------+
| ID   | State   | Hostname              | Resources                 | Manager | Manager Job ID |
+------+---------+-----------------------+---------------------------+---------+----------------+
| 1530 | RUNNING | cn34.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115785         |
| 1531 | RUNNING | cn16.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115209         |
| 1532 | RUNNING | cn73.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115378         |
| 1534 | RUNNING | cn71.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115741         |
| 1535 | RUNNING | cn88.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115316         |
| 1536 | RUNNING | cn84.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115331         |
| 1537 | RUNNING | cn42.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115272         |
| 1538 | RUNNING | cn44.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115374         |
| 1539 | RUNNING | cn54.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115649         |
| 1541 | RUNNING | cn99.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115302         |
| 1542 | RUNNING | cn181.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 115397         |
| 1543 | RUNNING | cn168.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 115532         |
| 1544 | RUNNING | cn191.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 115748         |
| 1545 | RUNNING | cn119.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 115422         |
| 1546 | RUNNING | cn111.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116091         |
| 1547 | RUNNING | cn124.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116112         |
| 1548 | RUNNING | cn129.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116113         |
| 1549 | RUNNING | cn187.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116218         |
| 1550 | RUNNING | cn55.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 116233         |
| 1551 | RUNNING | cn22.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 116234         |
| 1552 | RUNNING | cn114.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116235         |
| 1553 | RUNNING | cn8.barbora.it4i.cz   | cpus 2x18; mem 187.36 GiB | SLURM   | 116236         |
| 1554 | RUNNING | cn188.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116246         |
| 1555 | RUNNING | cn186.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116249         |
+------+---------+-----------------------+---------------------------+---------+----------------+

But I see the 10 in hq alloc info 5 (dropping previous finished workers from the list to make it shorter):

[svatosm@login1.barbora ~]$ /home/svatosm/hq-v0.16.0-linux-x64/hq alloc info 5
+--------+----------+----------------------------------------------+--------------+---------------------+---------------------+---------------------+
| ID     | State    | Working directory                            | Worker count | Queue time          | Start time          | Finish time         |
+--------+----------+----------------------------------------------+--------------+---------------------+---------------------+---------------------+
| 116091 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/161 | 1            | 04.01.2024 11:52:34 | 04.01.2024 13:20:58 |                     |
| 116112 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/162 | 1            | 04.01.2024 13:20:58 | 04.01.2024 13:21:15 |                     |
| 116113 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/163 | 1            | 04.01.2024 13:21:15 | 04.01.2024 13:21:31 |                     |
| 116114 | FAILED   | /home/svatosm/.hq-server/003/autoalloc/5/164 | 1            | 04.01.2024 13:21:31 | 04.01.2024 17:53:43 | 04.01.2024 17:53:44 |
| 116218 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/165 | 1            | 04.01.2024 18:01:05 | 04.01.2024 19:11:05 |                     |
| 116233 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/166 | 1            | 04.01.2024 19:11:05 | 04.01.2024 19:14:06 |                     |
| 116234 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/167 | 1            | 04.01.2024 19:14:06 | 04.01.2024 19:14:19 |                     |
| 116235 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/168 | 1            | 04.01.2024 19:19:19 | 04.01.2024 19:19:39 |                     |
| 116236 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/169 | 1            | 04.01.2024 19:24:39 | 04.01.2024 19:25:00 |                     |
| 116237 | FAILED   | /home/svatosm/.hq-server/003/autoalloc/5/170 | 1            | 04.01.2024 19:30:00 | 04.01.2024 19:30:00 | 04.01.2024 19:30:01 |
| 116241 | FAILED   | /home/svatosm/.hq-server/003/autoalloc/5/171 | 1            | 04.01.2024 19:45:00 | 04.01.2024 20:02:26 | 04.01.2024 20:02:27 |
| 116246 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/172 | 1            | 04.01.2024 20:30:00 | 04.01.2024 20:30:32 |                     |
| 116249 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/173 | 1            | 04.01.2024 21:00:32 | 04.01.2024 21:00:46 |                     |
+--------+----------+----------------------------------------------+--------------+---------------------+---------------------+---------------------+

I am rather confused by this situation. So, can max-worker-count be used to limit number of running jobs?

Kobzol commented 9 months ago

Hi, I'll answer these separately:

1) This option wasn't shown in the alloc queue table, but it will be once this PR is merged.

2) In general, your understanding of the option is correct. If you set max worker count to N, the given allocation queue shall not ever allow more than N workers to run in parallel. However, note that this parameter only applies to a single allocation queue. If you have multiple allocation queues, you can have more than N workers (you can also of course start additional workers manually).

But looking at the number of running jobs, I see I have 24 of them:

The number of running HyperQueue jobs is affected purely by the amount of available resources. If each job has a single task that requires 1 CPU, and you have a single worker with 128 cores, you could have 128 HQ jobs running even if you have only a single worker. So without more context, it's not enough to just look at the amount of HQ jobs.

But I see the 10 in hq alloc info 5 (dropping previous finished workers from the list to make it shorter):

This is what you should see. There should be at most 10 running workers.

I see the same when listing workers:

This either means that you have multiple allocation queues, or have started workers by some other means, or there is a bug in HQ.

svatosFZU commented 9 months ago

In this case (on Barbora), I have only one allocation queue and one hq job takes the whole worker node (36 cores). I do not start the jobs in any other way than through that one allocation queue. I can provide debug log if that helps.

Kobzol commented 9 months ago

You have sent hq alloc info 5, so the previous 4 allocation queues are paused or removed, right? Yeah, debug log and a sequence of commands that you have used since the start of the HQ server (since it seems to have some non-trivial state, e.g. the 5 allocation queues) would be helpful.

svatosFZU commented 9 months ago

OK, the debug log is here (128MB): https://www.fzu.cz/~svatosm/hq-debug-output.log The previous allocation queues were started with different parameters and are all removed (for some time). For the sequence of commands, I would need more details on what exactly is needed. As I am around 3k jobs and there are multiple checks of job statuses, there were probably tens of thousands of HQ commands performed since this server was started.

Kobzol commented 9 months ago

Hmm, something quite weird is happening here. Here is the part of the log for one Slurm allocation with ID 115741:

[2024-01-03T12:20:00.433Z DEBUG hyperqueue::server::autoalloc::process] Worker 1522 connected to allocation 115741
[2024-01-03T17:40:40.434Z DEBUG hyperqueue::server::autoalloc::process] Handling message WorkerLost(WorkerId(1522), ManagerInfo { manager: Slurm, allocation_id: "115741", time_limit: Some(172785s) }, LostWorkerDetails { reason: HeartbeatLost, lifetime: 19240.00128616s })
[2024-01-03T17:40:40.434Z DEBUG hyperqueue::server::autoalloc::process] Removing worker 1522 from allocation 115741
[2024-01-04T10:52:41.832Z DEBUG hyperqueue::server::autoalloc::process] Handling message WorkerConnected(WorkerId(1534), ManagerInfo { manager: Slurm, allocation_id: "115741", time_limit: Some(172779s) })
[2024-01-04T10:52:41.832Z DEBUG hyperqueue::server::autoalloc::process] Registering worker 1534 for allocation 115741
[2024-01-04T10:52:41.832Z DEBUG hyperqueue::server::autoalloc::process] Worker 1534 connected to allocation 115741
[2024-01-04T10:52:41.832Z WARN  hyperqueue::server::autoalloc::process] Allocation 115741 has status Finished { started_at: SystemTime { tv_sec: 1704284400, tv_nsec: 433489969 }, finished_at: SystemTime { tv_sec: 1704303640, tv_nsec: 434765520 }, disconnected_workers: DisconnectedWorkers { workers: Map({WorkerId(1522): LostWorkerDetails { reason: HeartbeatLost, lifetime: 19240.00128616s }}) } } and does not expect new workers

The worker 1522 has connected from this allocation on Wednesday 12:20. Then, it later disconnected at 17:40 because of a lost heartbeat (either the worker has crashed or the network connection between the worker and the server was disrupted). Around 17:00, Slurm started claiming that this allocation was moved from running state to either PENDING (PD) or CONFIGURING, which HQ interprets as being queued. So the first weird thing is why did a running allocation switch back to the pending state, which apparently confused HQ.

The second weird thing is that about 17 hours after the worker 1522 has crashed, a different worker 1534 has connected to HQ, claiming to be from the same allocation! But that is very weird, because the autoalloc start script will never start another worker inside the same allocation (unless your scripts are starting a worker, which I doubt). After this has happened, HQ was quite confused, and thought that the allocation was finished, and therefore it started a new allocation, although the 1534 worker is apparently still running (and that's why you see more than 10 workers being online).

It almost seems like Slurm executed the allocation, then stopped it for some reason, and then it restarted it from scratch (running the same original command) under the same allocation ID. If this is indeed what has happened, then this breaks many assumptions made by the automatic allocator and is the probable cause of your issue.

svatosFZU commented 9 months ago

According to the slurm details of 115741, it was restarted.

[svatosm@login1.barbora ~]$ scontrol show job 115741
JobId=115741 JobName=hq-alloc-5
   UserId=svatosm(4517) GroupId=svatosm(4647) MCS_label=N/A
   Priority=200268119 Nice=0 Account=open-29-6 QOS=2478_3949
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=1-02:46:50 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2024-01-03T17:04:45 EligibleTime=2024-01-03T17:06:46
   AccrueTime=2024-01-03T17:06:46
   StartTime=2024-01-04T11:52:20 EndTime=2024-01-05T14:39:10 Deadline=N/A
   PreemptEligibleTime=2024-01-04T11:52:20 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-01-04T11:52:20 Scheduler=Main
   Partition=qcpu AllocNode:Sid=login1:4137603
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cn71
   BatchHost=cn71
   NumNodes=1 NumCPUs=36 NumTasks=1 CPUs/Task=36 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=36,mem=190000M,node=1,billing=1
   AllocTRES=cpu=36,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=36 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/svatosm/.hq-server/003/autoalloc/5/154/hq-submit.sh
   WorkDir=/home/svatosm/.hq-server/003/autoalloc/5/154
   StdErr=/home/svatosm/.hq-server/003/autoalloc/5/154/stderr
   StdIn=/dev/null
   StdOut=/home/svatosm/.hq-server/003/autoalloc/5/154/stdout
   Power=
Kobzol commented 9 months ago

Ok, thank you, that confirms my suspicion. Well, the automatic allocator has no notion of allocation restarts at the moment (I had no idea that Slurm/PBS is even allowed to do that.. they should just give the allocation a new ID, IMO), so this will need more complex design and implementation work in order to be fixed. I'll think about it.