Open linusseelinger opened 4 months ago
Hi, workers automatically scan the usage of their node (including GPUs) every second by default, this data is being sent regularly to the server. Unless you use the dashboard, this information isn't currently used for anything though, so you can disable it if you want:
$ hq worker start --overview-interval 0s
Does this remove the error from the log for you?
Thanks a lot for your quick reply! Indeed, that option removes the error message from logs.
Turns out I had my own bug blocking the UM-Bridge code we are building on top of hyperqueue, so I thought the worker just never became responsive...
Still might be useful to limit logging for this kind of error, or turn this particular one into a warning if it doesn't mess with regular operation?
Yeah, that could be worth doing. I'm not sure how to recognize perfectly if the error is transient or if the thing just doesn't exist at all though. I'll try to add some better detection of this.
Yeah, that could be worth doing. I'm not sure how to recognize perfectly if the error is transient or if the thing just doesn't exist at all though. I'll try to add some better detection of this.
I would guess that nvidia-smi normally does not fail. So just a quick fix: when a first error occurs then stop calling it in all subsequent data collecting iterations.
I am trying to start a HQ worker directly on my system (Fedora). I have an nvidia GPU and nvidia's proprietary driver installed, but not the CUDA package. The latter seems to include
nvidia-smi
, which I don't have.Launching a worker gives a corresponding error, which seems to loop infinitely with 1s delays:
Could the worker be modified to run in that condition, e.g. just remove GPU suport if
nvidia-smi
is not available?