NVIDIA / nvbench

CUDA Kernel Benchmarking Library
Apache License 2.0
474 stars 63 forks source link

Throughput statistics are not calculated when reads/writes are declared after `state.exec()` #175

Open alliepiper opened 2 months ago

alliepiper commented 2 months ago

The current implementation computes the throughput statistics in measure_cold, which is invoked during state.exec. This has the undesirable effect that throughput statistics are not generated when reads/writes are declared after state.exec.

The statistics are added here.

The only piece of information that is needed from measure_cold is the average CUDA time. However, this information is added to a summary here, so it should be possible to add a post-processing step to retrieve this information and compute throughput statistics after the KernelGenerator is finished.