Open bernhardmgruber opened 2 days ago
this can be also achieved with the
nvml
library[...]
@ahendriksen, do you happen to know whether we can use those APIs instead of the approach outlined in the code above?
We cant: According to the API docs nvmlDeviceGetClockInfo
"Retrieves the current clock speeds for the device". That is not what we want.
We cant: According to the API docs
nvmlDeviceGetClockInfo
"Retrieves the current clock speeds for the device". That is not what we want.
Yes, but would nvmlDeviceGetCurrentClocksThrottleReasons
detect what we are trying to detect?
It could. There is one issue that the use of the NVML APIs does not help with and that is checking if something happened over time. They return an instantaneous result.
If you run a benchmark for 30 seconds, it doesn't matter what the clock throttle reason is at the end of the benchmark or what the clocks are at the end of the benchmark. It matters what the average clock frequency was during those 30 seconds. If you can additionally get clock throttle reasons, that would be nice, but not necessary. Clock throttle reason you want to know when debugging the hardware. For instance, to determine if the throttling happened due to thermal or power constraints. It doesn't help with debugging software.
@ahendriksen I always love how well you can specify a problem! Thx.
So let's implement the approach in the code above and try whether NVML can give us some useful additional diagnostics.
@ahendriksen sorry, why nvmlDeviceGetClockInfo()
cannot be used? You can use the API to get the SM clock at short intervals while the kernel is running. This is equivalent to using nvidia-smi
in the background to monitor clocks, power consumption, etc.
You can use the API to get the SM clock at short intervals while the kernel is running.
As you say, it is indeed not impossible. However, it's not a great solution for several reasons: 1) you are polling the GPU during a benchmark 2) it's error prone (you could miss one of the 100ms intervals, skewing the result) 3) it requires spinning up a separate thread during the benchmark
The proposed solution only requires running a kernel once before and once after the benchmark is done. It is used widely within Nvidia, and it works.
Is there a specific reason that we would want to exhaust all other possible options before using anything but the nvml API?
I'm just trying to understand pros & cons of these approaches.
you are polling the GPU during a benchmark
you can measure clocks before and after as in the custom kernel approach.
it's error prone (you could miss one of the 100ms intervals, skewing the result)
The polling interval is quite short, on the order of microseconds. Doing before/after is even worse, both with the custom kernel and with nvml
.
Is there a specific reason that we would want to exhaust all other possible options before using anything but the nvml API?
I have seen the nvml/nvidia-smi
approach to be effective in benchmarking GEMMs. I'm aware that the profilers nsight-compute/system prefer to patch the binary directly.
The kernel is measuring elapsed clock ticks. NVML is measuring clock frequency.
Sometimes, benchmark systems are unstable due to external factors and GPUs cannot keep up their clock frequency during a benchmark. This leads to wrong results.
NVBench should monitor the clock frequency during benchmarking and detect such conditions. One way is to query the global timer and SM block before and after the benchmark, and compute the average frequency:
where
f
launches the kernel to benchmark. If the computedclock_rate
is off from the expected value, we should issue a warning.