It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
266 stars 20 forks source link

Starting worker on machine with nvidia GPU but without nvidia-smi/CUDA #685

Open linusseelinger opened 4 months ago

linusseelinger commented 4 months ago

I am trying to start a HQ worker directly on my system (Fedora). I have an nvidia GPU and nvidia's proprietary driver installed, but not the CUDA package. The latter seems to include nvidia-smi, which I don't have.

Launching a worker gives a corresponding error, which seems to loop infinitely with 1s delays:

 ./hq worker start
2024-03-05T14:13:34Z INFO Detected 1 GPUs from procs
2024-03-05T14:13:34Z INFO Detected 33304358912B of memory (31.02 GiB)
2024-03-05T14:13:34Z INFO Starting hyperqueue worker nightly-2024-02-28-d42cc6563708f799c921b3d05678adc5fcef2744
2024-03-05T14:13:34Z INFO Connecting to: xps-9530:33635
2024-03-05T14:13:34Z INFO Listening on port 36431
2024-03-05T14:13:34Z INFO Connecting to server (candidate addresses = [[fe80::5fcd:941f:68f6:5efc%2]:33635, [2a00:1398:200:202:9d65:e4b1:e28b:b0e0]:33635, 172.23.213.13:33635])
+-------------------+----------------------------------+
| Worker ID         | 2                                |
| Hostname          | xps-9530                         |
| Started           | "2024-03-05T14:13:34.162287491Z" |
| Data provider     | xps-9530:36431                   |
| Working directory | /tmp/hq-worker.lJaUBMB2LjvD/work |
| Logging directory | /tmp/hq-worker.lJaUBMB2LjvD/logs |
| Heartbeat         | 8s                               |
| Idle timeout      | None                             |
| Resources         | cpus: 20                         |
|                   | gpus/nvidia: 1                   |
|                   | mem: 31.02 GiB                   |
| Time Limit        | None                             |
| Process pid       | 150177                           |
| Group             | default                          |
| Manager           | None                             |
| Manager Job ID    | N/A                              |
+-------------------+----------------------------------+
2024-03-05T14:13:35Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
2024-03-05T14:13:36Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
2024-03-05T14:13:37Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
2024-03-05T14:13:38Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
2024-03-05T14:13:39Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
...

Could the worker be modified to run in that condition, e.g. just remove GPU suport if nvidia-smi is not available?

Kobzol commented 4 months ago

Hi, workers automatically scan the usage of their node (including GPUs) every second by default, this data is being sent regularly to the server. Unless you use the dashboard, this information isn't currently used for anything though, so you can disable it if you want:

$ hq worker start --overview-interval 0s

Does this remove the error from the log for you?

linusseelinger commented 4 months ago

Thanks a lot for your quick reply! Indeed, that option removes the error message from logs.

Turns out I had my own bug blocking the UM-Bridge code we are building on top of hyperqueue, so I thought the worker just never became responsive...

Still might be useful to limit logging for this kind of error, or turn this particular one into a warning if it doesn't mess with regular operation?

Kobzol commented 4 months ago

Yeah, that could be worth doing. I'm not sure how to recognize perfectly if the error is transient or if the thing just doesn't exist at all though. I'll try to add some better detection of this.

spirali commented 4 months ago

Yeah, that could be worth doing. I'm not sure how to recognize perfectly if the error is transient or if the thing just doesn't exist at all though. I'll try to add some better detection of this.

I would guess that nvidia-smi normally does not fail. So just a quick fix: when a first error occurs then stop calling it in all subsequent data collecting iterations.