CLIP-HPC / SlurmCommander

Slurm TUI
MIT License
60 stars 6 forks source link

GPU Heterogeneity Issues #16

Closed reedacus25 closed 1 year ago

reedacus25 commented 1 year ago

I have a host that has multiple GPUs of different models.

{
  "gres": "gpu:p100:6(S:0),gpu:rtx:2(S:0)",
  "gres_drained": "N/A",
  "gres_used": "gpu:p100:0(IDX:N/A),gpu:rtx:0(IDX:N/A)"
}

scom (v1.0.4/21cee5ddc47eaad02dbdc37809f38085e194e6bf) (and also previously 1.0.0), reports as only 2 GPUs for this system.

If I go to the cluster tab and go to the node and pull up the statistics, the below are the reported stats for that node

Selected node:
Arch           : x86_64
Features       : $features
TRES           : cpu=88,mem=750G,billing=88,gres/gpu=8,gres/gpu:p100=6,gres/gpu:rtx=2
TRES Used      :
GRES           : gpu:p100:6(S:0),gpu:rtx:2(S:0)
GRES Used      : gpu:p100:0(IDX:N/A),gpu:rtx:0(IDX:N/A)
Partitions     : $partitions

Hopefully thats helpful. Appreciate the great tool!

pja237 commented 1 year ago

Hey @reedacus25

thank you very much for the bug-report, it's exactly what i need to know where the issue is 👍 All i'll ask of you now is for a little bit patience (and apologise for the delayed response) since i'm in the middle of a move and if @timeu won't be able to find time to tackle this before, i'll start the work on this as soon as the move is finished and i'm settled in (hopefully by next week).

leaving this open...

pja237 commented 1 year ago

Hey @reedacus25, just to let you know, i'm back online after the move and hopefully will be able to clean up this issue soon.

pja237 commented 1 year ago

@reedacus25 can you check the artifact from this build and let me know if it's all working as expected now? I've done some synthetic testing locally, but unfortunately don't have a heterogeneous gpu cluster at my disposal to try it live.

https://github.com/CLIP-HPC/SlurmCommander/actions/runs/4134059883

reedacus25 commented 1 year ago

@pja237 From a quick cursory glance, it looks to be working as expected, with the sum of the gpus showing the correct count in the cluster tab, as well as both sets of gpu's showing in the utilization bars at the top.

CPU used/total: 0/88                        GPU p100: used/total: 0/6
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0%    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0%
MEM used/total: 0/768000                    GPU rtx: used/total: 0/2
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0%    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0%
pja237 commented 1 year ago

Hey, glad to hear. I'll do the release later today. I've updated the pr with some more testing, fixed a small issue and took the liberty to add you to the contributors list: https://github.com/CLIP-HPC/SlurmCommander/blob/ee4dd407d99f45408af50715c42e67cefb64618f/internal/model/view.go#L50-L58 Hope that's ok, would you like me to update it with your name before i merge and release or is this fine with you?