ROCm / amdsmi

AMD SMI
https://rocm.docs.amd.com/projects/amdsmi/en/latest
MIT License
29 stars 12 forks source link

[Issue]: Incorrect Energy Consumption Reported by amdsmi_get_energy_count() Method #38

Open parthraut opened 3 weeks ago

parthraut commented 3 weeks ago

Problem Description

Issue with amdsmi.amdsmi_get_energy_count() Method

Description

When using the amdsmi.amdsmi_get_energy_count() method, the change in total energy consumption reported in Joules is much less than what it should be. This is evident when using the AMDSMI CLI tool to query the total energy consumption.

Observed Behavior

When running amd-smi metric -pE, the output is as follows:

GPU: 0 POWER: SOCKET_POWER: 35 W GFX_VOLTAGE: N/A mV SOC_VOLTAGE: N/A mV MEM_VOLTAGE: N/A mV POWER_MANAGEMENT: ENABLED THROTTLE_STATUS: UNTHROTTLED ENERGY: TOTAL_ENERGY_CONSUMPTION: 16.43 J ...

After waiting for one second and retrying:

GPU: 0 POWER: SOCKET_POWER: 35 W GFX_VOLTAGE: N/A mV SOC_VOLTAGE: N/A mV MEM_VOLTAGE: N/A mV POWER_MANAGEMENT: ENABLED THROTTLE_STATUS: UNTHROTTLED ENERGY: TOTAL_ENERGY_CONSUMPTION: 16.43 J ...

Expected Behavior

This does not make sense. The formula E = P * t means that the total energy consumption should have increased by ~35J after one second. But it does not seem to change.

Operating System

NAME="Rocky Linux", VERSION="9.1 (Blue Onyx)"

CPU

AMD EPYC 7V13 64-Core Processor

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.1.0

ROCm Component

amdsmi

Steps to Reproduce

This shell script can help replicate the issue. It runs amd-smi metric and waits 5 seconds:

#!/bin/bash

# Function to get the total energy consumption of GPU 0
get_energy_consumption() {
    amd-smi metric -pE | awk '/GPU: 0/,/GPU: 1/ { if ($1 == "TOTAL_ENERGY_CONSUMPTION:") print $2 }'
}

# Function to get the socket power of GPU 0
get_socket_power() {
    amd-smi metric -pE | awk '/GPU: 0/,/GPU: 1/ { if ($1 == "SOCKET_POWER:") print $2 }'
}

# Get the initial energy consumption of GPU 0
initial_energy=$(get_energy_consumption)

# Get the socket power of GPU 0
socket_power=$(get_socket_power)

# Wait for five seconds
sleep 5

# Get the energy consumption of GPU 0 after five seconds
final_energy=$(get_energy_consumption)

# Calculate the difference in energy consumption
energy_difference=$(echo "$final_energy - $initial_energy" | bc)

# Calculate the expected energy consumption over five seconds
expected_energy_consumption=$(echo "$socket_power * 5" | bc)

# Print the initial, final, and difference in energy consumption, and expected energy consumption
echo "Initial energy consumed by GPU 0: $initial_energy J"
echo "Final energy consumed by GPU 0: $final_energy J"
echo "Energy consumed by GPU 0 in the last five seconds: $energy_difference J"
echo "Expected energy consumption in last 5 seconds: $expected_energy_consumption J"

With my output being: Initial energy consumed by GPU 0: 19.748 J Final energy consumed by GPU 0: 19.748 J Energy consumed by GPU 0 in the last five seconds: 0 J Expected energy consumption in last 5 seconds: 160 J

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

We are using the AMD HPC cluster.

marifamd commented 3 weeks ago

@parthraut Can you give me the output of amd-smi version.

I believe that https://github.com/ROCm/amdsmi/issues/22 sees the same thing. We had an issue with updating our internal tables within the same instance of amd-smi. If that is the case the 6.1.X branch has this fix.

parthraut commented 3 weeks ago

$ amd-smi version AMDSMI Tool: 24.5.1+c5106a9 | AMDSMI Library version: 24.5.2.0 | ROCm version: 6.1.2

jaywonchung commented 3 weeks ago

To chime in, this happens when the GPU is idle -- the energy counter does not change at all. On the other hand, when I'm running compute on the GPU, the energy counter will actually increase (in both the Python interface and amd-smi), but at an extremely slow rate. I feel like the unit of counter_resolution field could be wrong.

(The issue reported by #22 was mostly resolved by the update to ROCm 6.1.2. Especially, the average_socket_power field updates correctly and we can measure energy by sampling it repetitively and integrating over time. But we also wanted to get the energy counter working.)