Open parthraut opened 3 weeks ago
@parthraut Can you give me the output of amd-smi version
.
I believe that https://github.com/ROCm/amdsmi/issues/22 sees the same thing. We had an issue with updating our internal tables within the same instance of amd-smi. If that is the case the 6.1.X branch has this fix.
$ amd-smi version AMDSMI Tool: 24.5.1+c5106a9 | AMDSMI Library version: 24.5.2.0 | ROCm version: 6.1.2
To chime in, this happens when the GPU is idle -- the energy counter does not change at all.
On the other hand, when I'm running compute on the GPU, the energy counter will actually increase (in both the Python interface and amd-smi
), but at an extremely slow rate. I feel like the unit of counter_resolution
field could be wrong.
(The issue reported by #22 was mostly resolved by the update to ROCm 6.1.2. Especially, the average_socket_power
field updates correctly and we can measure energy by sampling it repetitively and integrating over time. But we also wanted to get the energy counter working.)
Problem Description
Issue with
amdsmi.amdsmi_get_energy_count()
MethodDescription
When using the
amdsmi.amdsmi_get_energy_count()
method, the change in total energy consumption reported in Joules is much less than what it should be. This is evident when using the AMDSMI CLI tool to query the total energy consumption.Observed Behavior
When running
amd-smi metric -pE
, the output is as follows:GPU: 0 POWER: SOCKET_POWER: 35 W GFX_VOLTAGE: N/A mV SOC_VOLTAGE: N/A mV MEM_VOLTAGE: N/A mV POWER_MANAGEMENT: ENABLED THROTTLE_STATUS: UNTHROTTLED ENERGY: TOTAL_ENERGY_CONSUMPTION: 16.43 J ...
After waiting for one second and retrying:
GPU: 0 POWER: SOCKET_POWER: 35 W GFX_VOLTAGE: N/A mV SOC_VOLTAGE: N/A mV MEM_VOLTAGE: N/A mV POWER_MANAGEMENT: ENABLED THROTTLE_STATUS: UNTHROTTLED ENERGY: TOTAL_ENERGY_CONSUMPTION: 16.43 J ...
Expected Behavior
This does not make sense. The formula E = P * t means that the total energy consumption should have increased by ~35J after one second. But it does not seem to change.
Operating System
NAME="Rocky Linux", VERSION="9.1 (Blue Onyx)"
CPU
AMD EPYC 7V13 64-Core Processor
GPU
AMD Instinct MI100
ROCm Version
ROCm 6.1.0
ROCm Component
amdsmi
Steps to Reproduce
This shell script can help replicate the issue. It runs
amd-smi metric
and waits 5 seconds:With my output being: Initial energy consumed by GPU 0: 19.748 J Final energy consumed by GPU 0: 19.748 J Energy consumed by GPU 0 in the last five seconds: 0 J Expected energy consumption in last 5 seconds: 160 J
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
We are using the AMD HPC cluster.