Open dougbevan opened 5 years ago
I believe the sinfo command will give you the desired information about GRES in nodes. For example, use this command:
sinfo -o "%P %G %D %N"
That does give the total GPUs. It would be amazing to have output like this in pestat though, which give a number of useful metrics all in one output and give a great "quick glance" for our users.
With pestat -G we get a great output for cpus like:
Use/Tot 0 48
It would be useful to also see something like:
GRES GPUs Use / Tot 2 8
I understand now, so I've added a new column GRES/node which is printed if you select the -G flag. Can you try out the new script and tell me if this does what you want?
This is excellent. I tried it on one of our single node systems, and I see the available gpu and the GRES/job. Thanks for the addition -- this will be quite useful.
I'm glad this works for you! Please report any issues back to me.
Hello.
Thanks for providing a good tool.
"GRES/job" is not showing up in a clustered environment. Can I get an opinion?
master:pestat]#
master:pestat]# ./pestat -G
GRES (Generic Resource) is printed after each jobid
Hostname Partition Node Num_CPU CPUload Memsize Freemem GRES/ Joblist
State Use/Tot (MB) (MB) node JobId User GRES/job ...
n1 titanxp* idle 0 6 0.07 60000 62869 gpu:TitanXP:2
n2 titanxp* idle 0 6 0.01 60000 62952 gpu:TitanXP:2
n3 titanxp* idle 0 6 0.01 60000 62860 gpu:TitanXP:2
n4 titanxp* idle 0 6 0.01 60000 62891 gpu:TitanXP:2
n5 titanxp* idle 0 6 0.01 60000 62971 gpu:TitanXP:2
n6 titanxp* idle 0 6 0.09 60000 62945 gpu:TitanXP:2
n7 titanxp* idle 0 6 0.01 60000 63096 gpu:TitanXP:2
n8 titanxp* idle 0 6 0.02 60000 63084 gpu:TitanXP:2
n9 titanxp* mix 4 6 2.39* 60000 49649 gpu:TitanXP:2 2367 sonic 2360 sonic
n10 titanxp* idle 0 6 0.01 60000 63082 gpu:TitanXP:2
master:pestat]#
master:pestat]#
master:pestat]# sinfo --version
slurm 18.08.8
master:pestat]#
master:pestat]# cat /etc/redhat-release
CentOS Linux release 7.8.2003 (Core)
master:pestat]#
master:pestat]#
You're running an old and obsolete version of Slurm. Later versions have significantly improved GPU support, so maybe that's why you don't get the expected information.
The pestat command obtains information from Slurm with: sinfo -h -N $partition $hostlist $statelist -o "%N %P %C %O %m %e %t %Z %G" where the %G option prints: %G Generic resources (gres) associated with the nodes. Please check "man sinfo" in your Slurm version to see if %G exists.
Can you please test the latest version of pestat? The GRES/job is now being printed correctly.
It might be helpful to change the formatting from -o to -O to make use of the extra formatting options (such as GresUsed)
sinfo -h -N $partition $hostlist $statelist -o "%N %P %C %O %m %e %t %Z %G"
becomes:
sinfo -h -N $partition $hostlist $statelist -O "Nodes,Partition,CPUsState,CPUsLoad,Memory,FreeMem,StateCompact,Threads,Gres"
You could then add in GresUsed ( ideally cleaning it up a bit ) to achieve a more helpful overview of how many GPUs are in use/available on a node
change the formatting from -o to -O
I believe pestat
has already used -O
to retrieve information.
I am experiencing the same problem with you, to add a node-level GresUsed
in the output of pestat
. Therefore, I added it in my personal fork: https://github.com/yzs981130/Slurm_tools/commit/7e711af416d28bd98c1e9af71a23d686b3a10044. Hope it can help you!
cc @OleHolmNielsen What do you think about the node-level GresUsed
? Since it is my first time using awk
, I could send a draft pr if you think it is also needed.
You're right, it does use -O
now - I hadn't actually checked the code & was just going by the comments above. Thanks!
In that case, just changing Gres
to GresUsed
does a good enough job
In your fork the formatting has become a bit off for me:
Thank you for your suggestion. The GRES output shows how many GPUs are physically in the node.
With "pestat -G" the GRES used by each job on the node is printed. One could count manually how many GPUs are used.
I agree that the "sinfo -O GRESUSED" gives a useful summary of how many GPUs are in use.
However, I think that printing both GRES and GRESUSED data makes the output very long and difficult to read.
Maybe one could think of simplifying by having a "Num_GPU" column with simply the "Use/Tot" numbers. Some complicated parsing of GRES and GRESUSED would be needed.
There could be non-GPU types of GRES, see https://slurm.schedmd.com/gres.conf.html
Do you have suggestions for making the output of pestat more useful and simple to read?
Note added: Sites have to define their own GRES types in slurm.conf using the GresTypes parameter. It can become complex for "pestat" to decode all possible GRES types and extract numbers for "Use/Tot".
A very useful software. How can we list the available vs used GRES for gpus?
For instance, if I do:
pestat -G
This is partially good, as I can see the GRES being used. But it doesn't show the GRES available.
For CPUs, you get to see used/total (in my case 0/48). How can I get a similar output for gpus?