NVIDIA / nvbench

CUDA Kernel Benchmarking Library
Apache License 2.0
525 stars 66 forks source link

Check average clock frequency during benchmarks #193

Open bernhardmgruber opened 2 days ago

bernhardmgruber commented 2 days ago

Sometimes, benchmark systems are unstable due to external factors and GPUs cannot keep up their clock frequency during a benchmark. This leads to wrong results.

NVBench should monitor the clock frequency during benchmarking and detect such conditions. One way is to query the global timer and SM block before and after the benchmark, and compute the average frequency:

__global__ void get_timestamps_kernel(uint64_t* global_timestamp, uint64_t* sm0_timestamp) {
  uint32_t smid;
  asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
  if (smid == 0) {
    uint64_t gts, lts;
    asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(gts));
    lts = clock64();

    *global_timestamp = gts;
    *sm0_timestamp    = lts;
  }
}

ulong2 get_timestamps() {
  const int device_id = 0;
  int num_sms         = 0;
  CUDA_CHECK(cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, device_id));

  uint64_t* timestamps;
  CUDA_CHECK(cudaHostAlloc(&timestamps, 2 * sizeof(uint64_t), cudaHostAllocMapped));
  get_timestamps_kernel<<<num_sms, 1>>>(&timestamps[0], &timestamps[1]);
  CUDA_CHECK(cudaDeviceSynchronize());
  ulong2 ret{timestamps[0], timestamps[1]};
  CUDA_CHECK(cudaFreeHost(timestamps));
  return ret;
}

float measure_clock_frequency(std::function<void(void)> f) {
  ulong2 ts0 = get_timestamps();
  f();
  ulong2 ts1              = get_timestamps();
  uint64_t elapsed_ns     = ts1.x - ts0.x;
  uint64_t elapsed_clocks = ts1.y - ts0.y;
  float clock_rate        = float(elapsed_clocks) / float(elapsed_ns) * 1000000.f;
  return clock_rate;
}

where f launches the kernel to benchmark. If the computed clock_rate is off from the expected value, we should issue a warning.

fbusato commented 21 hours ago

this can be also achieved with the nvml library

bernhardmgruber commented 5 hours ago

this can be also achieved with the nvml library

[...]

@ahendriksen, do you happen to know whether we can use those APIs instead of the approach outlined in the code above?

ahendriksen commented 5 hours ago

We cant: According to the API docs nvmlDeviceGetClockInfo "Retrieves the current clock speeds for the device". That is not what we want.

bernhardmgruber commented 5 hours ago

We cant: According to the API docs nvmlDeviceGetClockInfo "Retrieves the current clock speeds for the device". That is not what we want.

Yes, but would nvmlDeviceGetCurrentClocksThrottleReasons detect what we are trying to detect?

ahendriksen commented 5 hours ago

It could. There is one issue that the use of the NVML APIs does not help with and that is checking if something happened over time. They return an instantaneous result.

If you run a benchmark for 30 seconds, it doesn't matter what the clock throttle reason is at the end of the benchmark or what the clocks are at the end of the benchmark. It matters what the average clock frequency was during those 30 seconds. If you can additionally get clock throttle reasons, that would be nice, but not necessary. Clock throttle reason you want to know when debugging the hardware. For instance, to determine if the throttling happened due to thermal or power constraints. It doesn't help with debugging software.

bernhardmgruber commented 5 hours ago

@ahendriksen I always love how well you can specify a problem! Thx.

So let's implement the approach in the code above and try whether NVML can give us some useful additional diagnostics.

fbusato commented 4 hours ago

@ahendriksen sorry, why nvmlDeviceGetClockInfo() cannot be used? You can use the API to get the SM clock at short intervals while the kernel is running. This is equivalent to using nvidia-smi in the background to monitor clocks, power consumption, etc.

ahendriksen commented 4 hours ago

You can use the API to get the SM clock at short intervals while the kernel is running.

As you say, it is indeed not impossible. However, it's not a great solution for several reasons: 1) you are polling the GPU during a benchmark 2) it's error prone (you could miss one of the 100ms intervals, skewing the result) 3) it requires spinning up a separate thread during the benchmark

The proposed solution only requires running a kernel once before and once after the benchmark is done. It is used widely within Nvidia, and it works.

Is there a specific reason that we would want to exhaust all other possible options before using anything but the nvml API?

fbusato commented 4 hours ago

I'm just trying to understand pros & cons of these approaches.

you are polling the GPU during a benchmark

you can measure clocks before and after as in the custom kernel approach.

it's error prone (you could miss one of the 100ms intervals, skewing the result)

The polling interval is quite short, on the order of microseconds. Doing before/after is even worse, both with the custom kernel and with nvml.

Is there a specific reason that we would want to exhaust all other possible options before using anything but the nvml API?

I have seen the nvml/nvidia-smi approach to be effective in benchmarking GEMMs. I'm aware that the profilers nsight-compute/system prefer to patch the binary directly.

ahendriksen commented 4 hours ago

The kernel is measuring elapsed clock ticks. NVML is measuring clock frequency.