OleHolmNielsen / Slurm_tools

My tools for the Slurm HPC workload manager
GNU General Public License v3.0
442 stars 96 forks source link

listing available gpus #3

Open dougbevan opened 5 years ago

dougbevan commented 5 years ago

A very useful software. How can we list the available vs used GRES for gpus?

For instance, if I do:

pestat -G

This is partially good, as I can see the GRES being used. But it doesn't show the GRES available.

For CPUs, you get to see used/total (in my case 0/48). How can I get a similar output for gpus?

OleHolmNielsen commented 5 years ago

I believe the sinfo command will give you the desired information about GRES in nodes. For example, use this command:

sinfo -o "%P %G %D %N"

dougbevan commented 5 years ago

That does give the total GPUs. It would be amazing to have output like this in pestat though, which give a number of useful metrics all in one output and give a great "quick glance" for our users.

With pestat -G we get a great output for cpus like:

Use/Tot 0 48

It would be useful to also see something like:

GRES GPUs Use / Tot 2 8

OleHolmNielsen commented 5 years ago

I understand now, so I've added a new column GRES/node which is printed if you select the -G flag. Can you try out the new script and tell me if this does what you want?

dougbevan commented 5 years ago

This is excellent. I tried it on one of our single node systems, and I see the available gpu and the GRES/job. Thanks for the addition -- this will be quite useful.

OleHolmNielsen commented 5 years ago

I'm glad this works for you! Please report any issues back to me.

cheekykite commented 3 years ago

Hello.

Thanks for providing a good tool.

"GRES/job" is not showing up in a clustered environment. Can I get an opinion?

master:pestat]#
master:pestat]# ./pestat  -G
GRES (Generic Resource) is printed after each jobid
Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem  GRES/   Joblist
                            State Use/Tot              (MB)     (MB)  node    JobId User GRES/job ...
      n1        titanxp*     idle   0   6    0.07     60000    62869  gpu:TitanXP:2
      n2        titanxp*     idle   0   6    0.01     60000    62952  gpu:TitanXP:2
      n3        titanxp*     idle   0   6    0.01     60000    62860  gpu:TitanXP:2
      n4        titanxp*     idle   0   6    0.01     60000    62891  gpu:TitanXP:2
      n5        titanxp*     idle   0   6    0.01     60000    62971  gpu:TitanXP:2
      n6        titanxp*     idle   0   6    0.09     60000    62945  gpu:TitanXP:2
      n7        titanxp*     idle   0   6    0.01     60000    63096  gpu:TitanXP:2
      n8        titanxp*     idle   0   6    0.02     60000    63084  gpu:TitanXP:2
      n9        titanxp*      mix   4   6    2.39*    60000    49649  gpu:TitanXP:2 2367 sonic  2360 sonic
     n10        titanxp*     idle   0   6    0.01     60000    63082  gpu:TitanXP:2
master:pestat]#
master:pestat]#
master:pestat]# sinfo --version
slurm 18.08.8
master:pestat]#
master:pestat]# cat /etc/redhat-release
CentOS Linux release 7.8.2003 (Core)
master:pestat]#
master:pestat]#
OleHolmNielsen commented 3 years ago

You're running an old and obsolete version of Slurm. Later versions have significantly improved GPU support, so maybe that's why you don't get the expected information.

The pestat command obtains information from Slurm with: sinfo -h -N $partition $hostlist $statelist -o "%N %P %C %O %m %e %t %Z %G" where the %G option prints: %G Generic resources (gres) associated with the nodes. Please check "man sinfo" in your Slurm version to see if %G exists.

OleHolmNielsen commented 3 years ago

Can you please test the latest version of pestat? The GRES/job is now being printed correctly.

clue2 commented 2 years ago

It might be helpful to change the formatting from -o to -O to make use of the extra formatting options (such as GresUsed)

sinfo -h -N $partition $hostlist $statelist -o "%N %P %C %O %m %e %t %Z %G"

becomes:

sinfo -h -N $partition $hostlist $statelist -O "Nodes,Partition,CPUsState,CPUsLoad,Memory,FreeMem,StateCompact,Threads,Gres" Screen Shot 2022-08-17 at 2 18 05 pm

You could then add in GresUsed ( ideally cleaning it up a bit ) to achieve a more helpful overview of how many GPUs are in use/available on a node

Screen Shot 2022-08-17 at 2 18 16 pm

yzs981130 commented 2 years ago

change the formatting from -o to -O

https://github.com/OleHolmNielsen/Slurm_tools/blob/21ef8a6852b08b94ff004e9ab6ad6065376fce21/pestat/pestat#L340

I believe pestat has already used -O to retrieve information.

I am experiencing the same problem with you, to add a node-level GresUsed in the output of pestat. Therefore, I added it in my personal fork: https://github.com/yzs981130/Slurm_tools/commit/7e711af416d28bd98c1e9af71a23d686b3a10044. Hope it can help you!

cc @OleHolmNielsen What do you think about the node-level GresUsed? Since it is my first time using awk, I could send a draft pr if you think it is also needed.

clue2 commented 2 years ago

You're right, it does use -O now - I hadn't actually checked the code & was just going by the comments above. Thanks!

In that case, just changing Gres to GresUsed does a good enough job Screen Shot 2022-08-22 at 10 51 13 am

In your fork the formatting has become a bit off for me: Screen Shot 2022-08-22 at 10 49 41 am

OleHolmNielsen commented 2 years ago

Thank you for your suggestion. The GRES output shows how many GPUs are physically in the node.

With "pestat -G" the GRES used by each job on the node is printed. One could count manually how many GPUs are used.

I agree that the "sinfo -O GRESUSED" gives a useful summary of how many GPUs are in use.

However, I think that printing both GRES and GRESUSED data makes the output very long and difficult to read.

Maybe one could think of simplifying by having a "Num_GPU" column with simply the "Use/Tot" numbers. Some complicated parsing of GRES and GRESUSED would be needed.

There could be non-GPU types of GRES, see https://slurm.schedmd.com/gres.conf.html

Do you have suggestions for making the output of pestat more useful and simple to read?

OleHolmNielsen commented 2 years ago

Note added: Sites have to define their own GRES types in slurm.conf using the GresTypes parameter. It can become complex for "pestat" to decode all possible GRES types and extract numbers for "Use/Tot".