It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
269 stars 20 forks source link

server panics and crashes #629

Closed jakub-homola closed 9 months ago

jakub-homola commented 9 months ago

I am trying to utilize HyperQueue to run a large number of small jobs on Karolina.

However, I am facing an issue with the server panicking after about an hour of running:

thread 'main' panicked at 'Invalid worker state, expected Running, got Waiting', crates/hyperqueue/src/server/job.rs:349:18

The server crashes afterwards.

I ran the server with RUST_LOG=hq=debug,tako=debug RUST_BACKTRACE=full, here is the full stderr

I launched the workers manually, running 2 workers per GPU compute node (one for each socket) with a jobscript like this:

ml HyperQueue

ROCR_VISIBLE_DEVICES= CUDA_VISIBLE_DEVICES=3 numactl -N 0-3 -m 0-3 hq worker start --idle-timeout=1m &

ROCR_VISIBLE_DEVICES= CUDA_VISIBLE_DEVICES=7 numactl -N 4-7 -m 4-7 hq worker start --idle-timeout=1m &

wait

Launched about 7 worker slurmjobs (so 14 workers) and submited about 12000 tasks in a bash script loop.

Any idea what could be wrong here?

I am running HyperQueue v0.16.0 on Karolina.

Thanks in advance.

Kobzol commented 9 months ago

Hi, thanks for the detailed error report. Looks like it's some kind of race condition, we will try to take a look as soon as possible.

By the way, why do you start two workers per node?

jakub-homola commented 9 months ago

Thanks.

I am doing some performance tests, and I need to use only a single CPU for that (and a single GPU). As the node has 2 CPUs, I start 2 workers per node, each using its own CPU.

Kobzol commented 9 months ago

I am doing some performance tests, and I need to use only a single CPU for that (and a single GPU). As the node has 2 CPUs, I start 2 workers per node, each using its own CPU.

That sounds a bit suspicious, since you don't specify any other resources or identification parameters for the two workers, which (probably) means that you don't really have a way of specifying which exact worker do you want to use (?).

The usual approach is to start a single worker per node, which takes ownership of all available resources, and then select the resources that you want to use for specific tasks using resource requests. You can specify even individual GPU IDs for each task, or use numactl for each task separately if you need more control over NUMA. But you shouldn't need to have two workers for that.

I'm asking because I wonder if your usage of HQ could be perhaps improved, or if you have a use case that is not supported well at the moment by HQ and that we could support in some better way.

jakub-homola commented 9 months ago

The workers should be restricred to their own CPU (using numactl) and GPU (using CUDA_VISIBLE_DEVICES), the worker outputs look like they recognize their resources correctly (or, as I expect them to):

2023-10-18T07:59:08Z INFO Some cores were filtered by a CPU mask. All cores: ["0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15", "16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31", "32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47", "48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63", "64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79", "80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95", "96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111", "112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127"]. Allowed cores: ["0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15", "16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31", "32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47", "48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63", "", "", "", ""].
2023-10-18T07:59:08Z INFO Detected GPUs 3 from `CUDA_VISIBLE_DEVICES`
2023-10-18T07:59:08Z INFO Detected 1081605935104B of memory (1007.32 GiB)
2023-10-18T07:59:08Z INFO SLURM environment detected
2023-10-18T07:59:08Z INFO Starting hyperqueue worker v0.16.0
2023-10-18T07:59:08Z INFO Connecting to: login1.karolina.it4i.cz:44219
2023-10-18T07:59:08Z INFO Listening on port 41204
2023-10-18T07:59:08Z INFO Connecting to server (candidate addresses = [10.48.24.1:44219])
+-------------------+----------------------------------+
| Worker ID         | 11                               |
| Hostname          | acn72.karolina.it4i.cz           |
| Started           | "2023-10-18T07:59:08.192455777Z" |
| Data provider     | acn72.karolina.it4i.cz:41204     |
| Working directory | /tmp/hq-worker.FThnf3qa4jTG/work |
| Logging directory | /tmp/hq-worker.FThnf3qa4jTG/logs |
| Heartbeat         | 8s                               |
| Idle timeout      | 1m                               |
| Resources         | cpus: 4x0 4x16                   |
|                   | gpus/nvidia: 1                   |
|                   | mem: 1007.32 GiB                 |
| Time Limit        | 1day 23h 59m 58s                 |
| Process pid       | 874                              |
| Group             | 122409                           |
| Manager           | SLURM                            |
| Manager Job ID    | 122409                           |
+-------------------+----------------------------------+
2023-10-18T07:59:09Z INFO Some cores were filtered by a CPU mask. All cores: ["0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15", "16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31", "32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47", "48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63", "64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79", "80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95", "96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111", "112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127"]. Allowed cores: ["", "", "", "", "64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79", "80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95", "96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111", "112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127"].
2023-10-18T07:59:09Z INFO Detected GPUs 7 from `CUDA_VISIBLE_DEVICES`
2023-10-18T07:59:09Z INFO Detected 1081605935104B of memory (1007.32 GiB)
2023-10-18T07:59:09Z INFO SLURM environment detected
2023-10-18T07:59:09Z INFO Starting hyperqueue worker v0.16.0
2023-10-18T07:59:09Z INFO Connecting to: login1.karolina.it4i.cz:44219
2023-10-18T07:59:09Z INFO Listening on port 46114
2023-10-18T07:59:09Z INFO Connecting to server (candidate addresses = [10.48.24.1:44219])
+-------------------+----------------------------------+
| Worker ID         | 12                               |
| Hostname          | acn72.karolina.it4i.cz           |
| Started           | "2023-10-18T07:59:09.264428079Z" |
| Data provider     | acn72.karolina.it4i.cz:46114     |
| Working directory | /tmp/hq-worker.lQNYjNns8piP/work |
| Logging directory | /tmp/hq-worker.lQNYjNns8piP/logs |
| Heartbeat         | 8s                               |
| Idle timeout      | 1m                               |
| Resources         | cpus: 4x0 4x16                   |
|                   | gpus/nvidia: 1                   |
|                   | mem: 1007.32 GiB                 |
| Time Limit        | 1day 23h 59m 57s                 |
| Process pid       | 931                              |
| Group             | 122409                           |
| Manager           | SLURM                            |
| Manager Job ID    | 122409                           |
+-------------------+----------------------------------+

you don't really have a way of specifying which exact worker do you want to use (?)

I'm not sure what you mean by that. For a given task, I do not care which worker executes it, so I do not need a way to specify which exact worker to use. The workers are restricted to only use the resources given to them using numactl and CUDA_VISIBLE_DEVICES.

This approach seemed quicker and simpler to me than figuring out how the resources work, how to make sure the cores belong to a single CPU and the GPU is the one connected to that CPU. But I will look into it and let you know it there is still a problem.

Kobzol commented 9 months ago

For a given task, I do not care which worker executes it

I see. So if I understand correctly, you are submitting tasks that each use 1 GPU and 64 CPU cores, and you want to make sure that: 1) All the 64 allocated cores are in the same NUMA socket 2) The allocated GPU is connected to the correct NUMA socket, w.r.t. to the CPU cores

Do I understand it correctly?

This is only partially supported by HQ, so you might indeed need two workers for this, at the moment. If you start a single worker, HQ can make sure that a task which requires 64 cores will be provided all 64 cores from the same NUMA socket. However, there is currently no way in HQ to "cross-link" different resources together, so you wouldn't be able to say something like "I want GPU X and 64 CPUs connected to the same NUMA socket". The only thing that you could say is e.g. "I have 1000 tasks for GPU 0 + 64 CPUs on socket 0 and I have another 1000 tasks on GPU 1 + 64 CPUs socket 1" (i.e., you would have to manually decide the NUMA socket for each task beforehand, which is not optimal w.r.t. load balancing).

Btw, we have found the issue causing the crash, we will try to implement a fix soon and send you a patched version (and also release a minor version bump of HQ, probably). The issue was caused by some task failing to even start. It would be useful to us to also see the logs from the workers, if you could run them with RUST_LOG=tako=debug,hq=debug and send the logs to us. Thank you!

jakub-homola commented 9 months ago

Do I understand it correctly?

Yes, exactly like that.

I could just ignore HQ's GPU resource, use only CPUs and manual pinning, and in the task, based on the value of HQ_CPUS, manually set CUDA_VISIBLE_DEVICES to the one that is closest.

Btw, we have found the issue causing the crash ...

Great :)

... and send the logs to us

Will do

jakub-homola commented 9 months ago

So here are the logs from a new run, both from the server and the workers: 20231018-124526.zip

I ran it using 5 workerjobs (so 10 workers) (not sure if one would be enough to introduce the problem). The logs are named hqworker_, then $SLURM_JOB_ID, 1 or 2 depending on which worker on the node it was, err or out, and .txt

Kobzol commented 9 months ago

Ok, now it's clear, the issue is disk quota:

[2023-10-18T11:14:09.899Z DEBUG tako::internal::worker::reactor] Task initialization failed id=6704, error=IoError(Os { code: 122, kind: FilesystemQuotaExceeded, message: "Disk quota exceeded" })

Of course, HQ shouldn't crash (I will send a PR with a fix soon), but even after the fix, you would just start seeing these kinds of errors in the HQ task status.

Kobzol commented 9 months ago

The bug should be fixed in the main branch now. You can use this build to test if it has helped.

jakub-homola commented 9 months ago

That's weird, I am now checking the disk quota, and I am well belowe the limits - 17/25 GB and 250k/500k entries. I would guess number of entries will be the issue. How many files does HQ typically generate? I would assume 2 for the stderr and stdout for each task, then I have my own separate stderr and stdout, maybe some temporaries ... I'd say not more than 10 per task, so 120k files, which is still in my limit. And despite the quota exceeded error, the numbers I am showing are well below the limit.

Anyway, I will move all the HQ related folders to /scratch to avoid the possible disk quota limitation.

And thanks for the fix, I will try it out.

Kobzol commented 9 months ago

It is indeed weird, because it seems that only some of the tasks have failed, which is strange.

The patched version should now include a more detailed error message which should tell you at which file path was the problem with the quota, that should help.