how to handle if the tuning results are different each time

python vector_add.py

Using: NVIDIA A100-SXM4-40GB block_size_x=128, time=0.118ms block_size_x=192, time=0.108ms block_size_x=256, time=0.103ms block_size_x=320, time=0.121ms block_size_x=384, time=0.103ms block_size_x=448, time=0.104ms block_size_x=512, time=0.106ms block_size_x=576, time=0.106ms block_size_x=640, time=0.104ms block_size_x=704, time=0.123ms block_size_x=768, time=0.112ms block_size_x=832, time=0.112ms block_size_x=896, time=0.105ms block_size_x=960, time=0.112ms block_size_x=1024, time=0.107ms best performing configuration: block_size_x=384, time=0.103ms

Using: NVIDIA A100-SXM4-40GB block_size_x=128, time=0.117ms block_size_x=192, time=0.101ms block_size_x=256, time=0.110ms block_size_x=320, time=0.105ms block_size_x=384, time=0.102ms block_size_x=448, time=0.102ms block_size_x=512, time=0.116ms block_size_x=576, time=0.106ms block_size_x=640, time=0.101ms block_size_x=704, time=0.117ms block_size_x=768, time=0.112ms block_size_x=832, time=0.106ms block_size_x=896, time=0.103ms block_size_x=960, time=0.102ms block_size_x=1024, time=0.102ms best performing configuration: block_size_x=192, time=0.101ms

Using: NVIDIA A100-SXM4-40GB block_size_x=128, time=0.112ms block_size_x=192, time=0.110ms block_size_x=256, time=0.102ms block_size_x=320, time=0.111ms block_size_x=384, time=0.110ms block_size_x=448, time=0.118ms block_size_x=512, time=0.118ms block_size_x=576, time=0.107ms block_size_x=640, time=0.119ms block_size_x=704, time=0.114ms block_size_x=768, time=0.109ms block_size_x=832, time=0.112ms block_size_x=896, time=0.118ms block_size_x=960, time=0.104ms block_size_x=1024, time=0.103ms best performing configuration: block_size_x=256, time=0.102ms

Hi @jinghere11!

There are various factors that can cause small variations in the kernel execution time. First of all, it's good to be aware that changing the thread block size in a bandwidth-bound kernel such as vector addition isn't that impactful on performance, which is why the execution times are all very close to begin with.

GPUs generally use boost frequencies when there is enough power available and the chip isn't running too hot yet. When the GPU is warmed up, it might throttle clock frequencies to prevent overheating, which is also another source of jitter. One way to create more stable measurements is to fix the clock frequencies to some fixed clock frequency that you believe the GPU is capable of sustaining. You can do this in Kernel Tuner by adding one more tunable parameter with the name nvml_gr_clock, as explained here: https://kerneltuner.github.io/kernel_tuner/stable/observers.html#tuning-execution-parameters-with-nvml This is also a recommended best practice for stable benchmarking of CUDA kernels according to Nvidia: https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9956-best-practices-when-benchmarking-cuda-applications_V2.pdf

Finally, it is good to know that by default, Kernel Tuner benchmarks each kernel configuration 7 times, if you still see a lot of variation in the kernel execution times on your system you could also increase this number using the iterations= option of tune_kernel.

Let me know if this solves your problem!

Hi @jinghere11!

There are various factors that can cause small variations in the kernel execution time. First of all, it's good to be aware that changing the thread block size in a bandwidth-bound kernel such as vector addition isn't that impactful on performance, which is why the execution times are all very close to begin with.

GPUs generally use boost frequencies when there is enough power available and the chip isn't running too hot yet. When the GPU is warmed up, it might throttle clock frequencies to prevent overheating, which is also another source of jitter. One way to create more stable measurements is to fix the clock frequencies to some fixed clock frequency that you believe the GPU is capable of sustaining. You can do this in Kernel Tuner by adding one more tunable parameter with the name nvml_gr_clock, as explained here: https://kerneltuner.github.io/kernel_tuner/stable/observers.html#tuning-execution-parameters-with-nvml This is also a recommended best practice for stable benchmarking of CUDA kernels according to Nvidia: https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9956-best-practices-when-benchmarking-cuda-applications_V2.pdf

Finally, it is good to know that by default, Kernel Tuner benchmarks each kernel configuration 7 times, if you still see a lot of variation in the kernel execution times on your system you could also increase this number using the iterations= option of tune_kernel.

Let me know if this solves your problem!

Thanks for your detailed explanation！your response not only solved my problem but also highly educational.

KernelTuner / kernel_tuner

how to handle if the tuning results are different each time #282