albanie / slurm_gpustat

A simple command line tool to show GPU usage on a SLURM cluster
102 stars 23 forks source link

squeue: error: Invalid job format specification: tres-per-node #7

Closed ThibaultGROUEIX closed 3 years ago

ThibaultGROUEIX commented 3 years ago

Thanks for useful tool! This shows when slurm_gpustat is called. Cheers

albanie commented 3 years ago

Hi @ThibaultGROUEIX, thanks for raising this!

I think the command that's failing is squeue -O tres-per-node,nodelist,username,jobid --noheader called here (and it seems the tres-per-node option documented here is failing). Unfortunately the current implementation uses this flag to count the number of GPUs being used, so it might be a little tricky to work around.

Would you mind posting the version of SLURM you are using, so I can update the README to warn others about this problem? For reference, the version I test things on is slurm 18.08.7.

Thanks!

jonatasgrosman commented 3 years ago

Hello @albanie, I'm having the same problem here. My SLURM version is 17.02.11-Bull.1.1

ThibaultGROUEIX commented 3 years ago

Hi @albanie, Sorry for the late reply, I didn't see your answer before I reran into this problem, looked for a fix, and find my own issue again^^ My version is slurm 17.02.7 Best regards

emiliojorge commented 3 years ago

I am getting the same message in 17.11.2. Could it be that GPU resources are not tracked since there is nothing along the lines of AccountingStorageTRES=gres/gpu,gres/gpu:tesla in the slurm.conf? See docs Edit: Realized it is probably just due to old slurm.. Will try to update SLURM at some point. Edit2: Seems to work on slurm 19.

albanie commented 3 years ago

Thanks both - I will update the README to reflect that there are issues on older versions.

albanie commented 3 years ago

I will close this for now, because I don't have a way to debug (I sadly don't have access to older SLURM versions for development) - but feel free to re-open if it's useful to discuss further.

yuhui-zh15 commented 3 years ago

Thanks for the authors for sharing this amazing tool! We can make a simple change to support SLURM 17. See https://github.com/yuhui-zh15/slurm_gpustat/commit/b4814b31b4ee036a0548b7212307741d4b8b71a6. You can try pip install git+https://github.com/yuhui-zh15/slurm_gpustat.git.