albanie / slurm_gpustat

A simple command line tool to show GPU usage on a SLURM cluster
102 stars 23 forks source link

Added --partition argument and adjusted node & GPU name parsing to work on our cluster #6

Closed talesa closed 3 years ago

talesa commented 3 years ago

By default 30 chars were too short to print some of our node names, increased the limit.

Some of our GPUs had weird names, changed the way it's parsed to a regexp, added examples of some weird outputs I got on some nodes to the following website where people contributing can keep iterating on the regexp used to cover other weird strings they encounter. https://regex101.com/r/RHYM8Z/3

frankier commented 3 years ago

This PR makes it almost work for me in ARC, but I had another problem afterwards. Sometimes we get tokens as:

tokens ['gpu', 'arcus-htc-***, ***, ***]

In which case we can't get num_gpus. I'm not sure what the solution is here. Should num_gpus be assumed to be 1?

talesa commented 3 years ago

I've never used ARC so I don't know how the output is formatted there or how many GPUs per node are there.

albanie commented 3 years ago

@talesa - thanks a lot for this! @frankier, would you be able to paste input/output samples (so I can check that the PR doesn't break stuff for you) when merging?

frankier commented 3 years ago

So the situation is that I'm +1 merging this PR because it makes this script go from not working -> almost working for me. What makes it work for me is adding also the following, which just throws away rows without a number of GPU specified. However, this isn't going to produce an accurate result probably, since it throws away data. If you merge the PR, then I can go ahead an put this in a more detailed new issue if/when I can replicate it.

*** slurm_gpustat.py.1  2020-12-14 06:40:58.643007000 +0000
--- slurm_gpustat.py    2020-11-30 14:03:42.053292000 +0000
***************
*** 527,536 ****
--- 527,538 ----
          # ignore pending jobs
          if len(tokens) < 4 or not tokens[0].startswith("gpu"):
              continue
          gpu_count_str, node_str, user, jobid = tokens
          gpu_count_tokens = gpu_count_str.split(":")
+         if len(gpu_count_tokens) == 1:
+             continue
          num_gpus = int(gpu_count_tokens[-1])
          if len(gpu_count_tokens) == 2:
              gpu_type = None
          elif len(gpu_count_tokens) == 3:
              gpu_type = gpu_count_tokens[1]