Open com-data opened 1 week ago
Hi! Thanks for the report. Here is what is happening:
1) You start a HQ worker
2) You specify that the allocation manager should be Slurm, therefore the worker tries to detect some Slurm values from the environment
3) It tries to ask Slurm what is its remaining time limit via scontrol show job <SLURM_JOB_ID>
. Unexpectedly, the server returns INVALID
, which cannot be parsed as a duration (obviously). Not sure why Slurm returns this :man_shrugging:
Now, arguably HQ should probably skip reading these values instead of crashing here. On the other hand, if you do specify that you want to use Slurm, and it is not possible to find the remaining time limit, the worker could start without a time limit, which could be annoying in some cases. So crashing here tells you that something went wrong.
As a hotfix, you can try --manager none
.
As a sort of a separate question, do you have a specific reason for running the HQ server within a Slurm allocation? It's a valid use-case, but normally you can also run it on login nodes, which should be more ergonomic to use.
Thank you so much for your help, it works now without losing workers. Starting HQ server on login node would be a good choice but when I do that and try to start workers using Slurm jobs, I get an error which is along the lines of "access token found but server is not reachable". This makes me think that in such a case the HQ server and workers do not communicate. Looking into HQ documentation, it seems that for a user without admin rights fixing the communication problems may not be possible.
Thank you again for designing and improving HQ.
I see. On most HPC clusters that we have tried it, login and compute nodes can communicate without problems. But if this is not possible on your cluster, then indeed you'll need to run the server inside allocations.
Hello,
Firstly thank you for developing HQ.
I recently came across a crash while testing HQ v0.19.0. The submission script which is submitted to SLURM manager is as follows:
The error message is pasted below. Interestingly, this error message appears during only some of the identical runs tested. Please let me know if I should provide more information for debugging.