ROCm / rocm_smi_lib

ROCm SMI LIB
https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/
MIT License
111 stars 48 forks source link

Fixing VRAM reporting with --json set #120

Closed jerome3o closed 1 year ago

jerome3o commented 1 year ago

This is a fix for getting accurate VRAM readings when the --json flag

Before the change, when running rocm-smi --json --showmemuse i get:

{
  "card0": {
    "GPU memory use (%)": "0",
    "Memory Activity": "N/A"
  },
  "card1": {
    "GPU memory use (%)": "0",
    "Memory Activity": "N/A"
  }
}

After this fix i get:

{
  "card0": {
    "GPU memory use (%)": "0.10168946267106549",
    "GPU memory use": "17453056",
    "GPU memory available": "17163091968",
    "Memory Activity": "N/A"
  },
  "card1": {
    "GPU memory use (%)": "0.10171332783479961",
    "GPU memory use": "17457152",
    "GPU memory available": "17163091968",
    "Memory Activity": "N/A"
  }
}

I have only tested locally with 2 RX6800s and: AMD ROCm System Management Interface | ROCM-SMI version: 1.4.1 | Kernel version: 5.18.13

If interested, I needed this to set up a prometheus exporter in python: https://github.com/jerome3o/rocm-prom-metrics

Let me know if there is anything I can do to help get this merged :)

dmitrii-galantsev commented 1 year ago

showmemuse works by reading a sysfs file. See /sys/class/drm/card0/device/mem_busy_percent (or card1 or whatever) on your system with an AMD gpu :)
I don't think we can/should change that functionality completely.
Can you use rocm-smi --json --showmeminfo vram and calculate the usage % from there? That would be roughly equivalent to your PR.

Ref: https://github.com/RadeonOpenCompute/rocm_smi_lib/blob/f8882d74d8749e2ad788184d624167cc326d4c2c/src/rocm_smi.cc#LL2733C40-L2733C40 https://github.com/RadeonOpenCompute/rocm_smi_lib/blob/f8882d74d8749e2ad788184d624167cc326d4c2c/src/rocm_smi_device.cc#LL119C37-L119C37

dmitrii-galantsev commented 1 year ago

p.s. the prometheus exporter is neat!