Open lars-t-hansen opened 1 week ago
This is not good but points at least in some directions: https://www.techygpu.com/2024/08/17/how-to-limit-gpu-power-draw/
Intel GPUs, if anyone cares: https://community.intel.com/t5/Intel-Graphics-Performance/How-to-get-GPU-power-in-Watt/m-p/1610960
So I guess (nvidia):
ps
readingA typical card reading (nvidia-smi -q -d POWER) looks like so:
GPU 00000000:86:00.0
GPU Power Readings
Power Draw : 19.10 W
Current Power Limit : 250.00 W
Requested Power Limit : 250.00 W
Default Power Limit : 250.00 W
Min Power Limit : 100.00 W
Max Power Limit : 280.00 W
Power Samples
Duration : 118.29 sec
Number of Samples : 119
Max : 19.56 W
Min : 18.24 W
Avg : 19.01 W
GPU Memory Power Readings
Power Draw : N/A
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Its more than a little annoying that the GPU is identified by some address instead of the GPU number used elsewhere but I guess we'll just build a map.
Just nvidia-smi -q
yields a ton of information about all sorts of things, if we encode everything it's going to be a little expensive. On the other hand, for the benchmarking we want to do we probably want more rather than less.
The thing to do here is probably to:
If desired, we could also add power limits to the ps output, if they are easily adjustable at runtime by the operator. Alternatively we just run sysinfo more often to capture this setting (say, hourly)
I suppose we could just grab the output from nvidia-smi -q
, quote it somehow (base64), and ship the whole thing off and let the client sort it out. It's 40K uncompressed but just 3K compressed (binary), so it's not like it's a lot of data. It's probably more work to compress it on the client than to parse it, but I'm not sure if anyone cares - most of the work is in exfiltration anyway. The risk is that it's not enough data, eg because the mapping from card addresses to card indices needs to be exposed somehow.
At least these:
It may be that some of these (power level) change often enough that sysinfo is not the right venue for it but it's a start.