Updating the sizes for this benchmark since we use different heuristics leading to kernel recompilation on A100 and H100.
Before: sizes = [4, 8, 16, 32, 64, 128
Current: sizes = [5, 7, 9, 11]
The dynamic measurement has lower standard deviation since we reuse kernels for all cases, and the average measurement is ~1.8ms as opposed to ~80ms with the earlier input sizes, with the maximum measurement of ~400ms
Updating the sizes for this benchmark since we use different heuristics leading to kernel recompilation on A100 and H100.
Before:
sizes = [4, 8, 16, 32, 64, 128
Current:
sizes = [5, 7, 9, 11]
The
dynamic
measurement has lower standard deviation since we reuse kernels for all cases, and the average measurement is ~1.8ms as opposed to ~80ms with the earlier input sizes, with the maximum measurement of ~400ms