Open svatosFZU opened 9 months ago
Hi, I'll answer these separately:
1) This option wasn't shown in the alloc queue table, but it will be once this PR is merged.
2) In general, your understanding of the option is correct. If you set max worker count to N, the given allocation queue shall not ever allow more than N workers to run in parallel. However, note that this parameter only applies to a single allocation queue. If you have multiple allocation queues, you can have more than N workers (you can also of course start additional workers manually).
But looking at the number of running jobs, I see I have 24 of them:
The number of running HyperQueue jobs is affected purely by the amount of available resources. If each job has a single task that requires 1 CPU, and you have a single worker with 128 cores, you could have 128 HQ jobs running even if you have only a single worker. So without more context, it's not enough to just look at the amount of HQ jobs.
But I see the 10 in hq alloc info 5 (dropping previous finished workers from the list to make it shorter):
This is what you should see. There should be at most 10 running workers.
I see the same when listing workers:
This either means that you have multiple allocation queues, or have started workers by some other means, or there is a bug in HQ.
In this case (on Barbora), I have only one allocation queue and one hq job takes the whole worker node (36 cores). I do not start the jobs in any other way than through that one allocation queue. I can provide debug log if that helps.
You have sent hq alloc info 5
, so the previous 4 allocation queues are paused or removed, right? Yeah, debug log and a sequence of commands that you have used since the start of the HQ server (since it seems to have some non-trivial state, e.g. the 5 allocation queues) would be helpful.
OK, the debug log is here (128MB): https://www.fzu.cz/~svatosm/hq-debug-output.log The previous allocation queues were started with different parameters and are all removed (for some time). For the sequence of commands, I would need more details on what exactly is needed. As I am around 3k jobs and there are multiple checks of job statuses, there were probably tens of thousands of HQ commands performed since this server was started.
Hmm, something quite weird is happening here. Here is the part of the log for one Slurm allocation with ID 115741:
[2024-01-03T12:20:00.433Z DEBUG hyperqueue::server::autoalloc::process] Worker 1522 connected to allocation 115741
[2024-01-03T17:40:40.434Z DEBUG hyperqueue::server::autoalloc::process] Handling message WorkerLost(WorkerId(1522), ManagerInfo { manager: Slurm, allocation_id: "115741", time_limit: Some(172785s) }, LostWorkerDetails { reason: HeartbeatLost, lifetime: 19240.00128616s })
[2024-01-03T17:40:40.434Z DEBUG hyperqueue::server::autoalloc::process] Removing worker 1522 from allocation 115741
[2024-01-04T10:52:41.832Z DEBUG hyperqueue::server::autoalloc::process] Handling message WorkerConnected(WorkerId(1534), ManagerInfo { manager: Slurm, allocation_id: "115741", time_limit: Some(172779s) })
[2024-01-04T10:52:41.832Z DEBUG hyperqueue::server::autoalloc::process] Registering worker 1534 for allocation 115741
[2024-01-04T10:52:41.832Z DEBUG hyperqueue::server::autoalloc::process] Worker 1534 connected to allocation 115741
[2024-01-04T10:52:41.832Z WARN hyperqueue::server::autoalloc::process] Allocation 115741 has status Finished { started_at: SystemTime { tv_sec: 1704284400, tv_nsec: 433489969 }, finished_at: SystemTime { tv_sec: 1704303640, tv_nsec: 434765520 }, disconnected_workers: DisconnectedWorkers { workers: Map({WorkerId(1522): LostWorkerDetails { reason: HeartbeatLost, lifetime: 19240.00128616s }}) } } and does not expect new workers
The worker 1522
has connected from this allocation on Wednesday 12:20. Then, it later disconnected at 17:40 because of a lost heartbeat (either the worker has crashed or the network connection between the worker and the server was disrupted). Around 17:00, Slurm started claiming that this allocation was moved from running state to either PENDING (PD)
or CONFIGURING
, which HQ interprets as being queued. So the first weird thing is why did a running allocation switch back to the pending state, which apparently confused HQ.
The second weird thing is that about 17 hours after the worker 1522
has crashed, a different worker 1534
has connected to HQ, claiming to be from the same allocation! But that is very weird, because the autoalloc start script will never start another worker inside the same allocation (unless your scripts are starting a worker, which I doubt). After this has happened, HQ was quite confused, and thought that the allocation was finished, and therefore it started a new allocation, although the 1534
worker is apparently still running (and that's why you see more than 10 workers being online).
It almost seems like Slurm executed the allocation, then stopped it for some reason, and then it restarted it from scratch (running the same original command) under the same allocation ID. If this is indeed what has happened, then this breaks many assumptions made by the automatic allocator and is the probable cause of your issue.
According to the slurm details of 115741
, it was restarted.
[svatosm@login1.barbora ~]$ scontrol show job 115741
JobId=115741 JobName=hq-alloc-5
UserId=svatosm(4517) GroupId=svatosm(4647) MCS_label=N/A
Priority=200268119 Nice=0 Account=open-29-6 QOS=2478_3949
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=1-02:46:50 TimeLimit=2-00:00:00 TimeMin=N/A
SubmitTime=2024-01-03T17:04:45 EligibleTime=2024-01-03T17:06:46
AccrueTime=2024-01-03T17:06:46
StartTime=2024-01-04T11:52:20 EndTime=2024-01-05T14:39:10 Deadline=N/A
PreemptEligibleTime=2024-01-04T11:52:20 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-01-04T11:52:20 Scheduler=Main
Partition=qcpu AllocNode:Sid=login1:4137603
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cn71
BatchHost=cn71
NumNodes=1 NumCPUs=36 NumTasks=1 CPUs/Task=36 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=36,mem=190000M,node=1,billing=1
AllocTRES=cpu=36,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=36 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/home/svatosm/.hq-server/003/autoalloc/5/154/hq-submit.sh
WorkDir=/home/svatosm/.hq-server/003/autoalloc/5/154
StdErr=/home/svatosm/.hq-server/003/autoalloc/5/154/stderr
StdIn=/dev/null
StdOut=/home/svatosm/.hq-server/003/autoalloc/5/154/stdout
Power=
Ok, thank you, that confirms my suspicion. Well, the automatic allocator has no notion of allocation restarts at the moment (I had no idea that Slurm/PBS is even allowed to do that.. they should just give the allocation a new ID, IMO), so this will need more complex design and implementation work in order to be fixed. I'll think about it.
Hi, I am trying to limit the rate at which out allocation is depleted. For that, I thought the max-worker-count option of allocation queue would be the way to do it. Now, watching it run, I have few questions about it. 1.) Is there a way to see that the option was propagated to the HQ? The
hq alloc list
does not show it:and
hq alloc info 5
does not show details of the allocation setting.2.) How does it actually work? According to the docs (https://it4innovations.github.io/hyperqueue/v0.16.0/deployment/allocation/#max-worker-count), it should set maximum number of workers that can be queued or running. So, I set the allocation queu like this:
With one allocation queue running 1 worker per allocation and maximum of 10 workers, I assumed that I would have maximum of 10 running workers/jobs. But looking at the number of running jobs, I see I have 24 of them:
I see the same when listing workers:
But I see the 10 in
hq alloc info 5
(dropping previous finished workers from the list to make it shorter):I am rather confused by this situation. So, can max-worker-count be used to limit number of running jobs?