NordicHPC / sonar

Tool to profile usage of HPC resources by regularly probing processes using ps.
GNU General Public License v3.0
8 stars 5 forks source link

Collect more sysinfo data #200

Open lars-t-hansen opened 1 week ago

lars-t-hansen commented 1 week ago

At least these:

It may be that some of these (power level) change often enough that sysinfo is not the right venue for it but it's a start.

lars-t-hansen commented 4 days ago

This is not good but points at least in some directions: https://www.techygpu.com/2024/08/17/how-to-limit-gpu-power-draw/

lars-t-hansen commented 4 days ago

Intel GPUs, if anyone cares: https://community.intel.com/t5/Intel-Graphics-Performance/How-to-get-GPU-power-in-Watt/m-p/1610960

lars-t-hansen commented 4 days ago

nvidia-smi: https://linuxconfig.org/how-to-set-nvidia-power-limit-on-ubuntu

lars-t-hansen commented 4 days ago

So I guess (nvidia):

A typical card reading (nvidia-smi -q -d POWER) looks like so:

GPU 00000000:86:00.0
    GPU Power Readings
        Power Draw                        : 19.10 W
        Current Power Limit               : 250.00 W
        Requested Power Limit             : 250.00 W
        Default Power Limit               : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 280.00 W
    Power Samples
        Duration                          : 118.29 sec
        Number of Samples                 : 119
        Max                               : 19.56 W
        Min                               : 18.24 W
        Avg                               : 19.01 W
    GPU Memory Power Readings
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A

Its more than a little annoying that the GPU is identified by some address instead of the GPU number used elsewhere but I guess we'll just build a map.

lars-t-hansen commented 4 days ago

Just nvidia-smi -q yields a ton of information about all sorts of things, if we encode everything it's going to be a little expensive. On the other hand, for the benchmarking we want to do we probably want more rather than less.

lars-t-hansen commented 4 days ago

The thing to do here is probably to:

If desired, we could also add power limits to the ps output, if they are easily adjustable at runtime by the operator. Alternatively we just run sysinfo more often to capture this setting (say, hourly)

lars-t-hansen commented 4 days ago

I suppose we could just grab the output from nvidia-smi -q, quote it somehow (base64), and ship the whole thing off and let the client sort it out. It's 40K uncompressed but just 3K compressed (binary), so it's not like it's a lot of data. It's probably more work to compress it on the client than to parse it, but I'm not sure if anyone cares - most of the work is in exfiltration anyway. The risk is that it's not enough data, eg because the mapping from card addresses to card indices needs to be exposed somehow.