Closed talesa closed 3 years ago
This PR makes it almost work for me in ARC, but I had another problem afterwards. Sometimes we get tokens as:
tokens ['gpu', 'arcus-htc-***, ***, ***]
In which case we can't get num_gpus
. I'm not sure what the solution is here. Should num_gpus
be assumed to be 1?
I've never used ARC so I don't know how the output is formatted there or how many GPUs per node are there.
@talesa - thanks a lot for this! @frankier, would you be able to paste input/output samples (so I can check that the PR doesn't break stuff for you) when merging?
So the situation is that I'm +1 merging this PR because it makes this script go from not working -> almost working for me. What makes it work for me is adding also the following, which just throws away rows without a number of GPU specified. However, this isn't going to produce an accurate result probably, since it throws away data. If you merge the PR, then I can go ahead an put this in a more detailed new issue if/when I can replicate it.
*** slurm_gpustat.py.1 2020-12-14 06:40:58.643007000 +0000
--- slurm_gpustat.py 2020-11-30 14:03:42.053292000 +0000
***************
*** 527,536 ****
--- 527,538 ----
# ignore pending jobs
if len(tokens) < 4 or not tokens[0].startswith("gpu"):
continue
gpu_count_str, node_str, user, jobid = tokens
gpu_count_tokens = gpu_count_str.split(":")
+ if len(gpu_count_tokens) == 1:
+ continue
num_gpus = int(gpu_count_tokens[-1])
if len(gpu_count_tokens) == 2:
gpu_type = None
elif len(gpu_count_tokens) == 3:
gpu_type = gpu_count_tokens[1]
By default 30 chars were too short to print some of our node names, increased the limit.
Some of our GPUs had weird names, changed the way it's parsed to a regexp, added examples of some weird outputs I got on some nodes to the following website where people contributing can keep iterating on the regexp used to cover other weird strings they encounter. https://regex101.com/r/RHYM8Z/3