Throughput statistics are not calculated when reads/writes are declared after `state.exec()`

The current implementation computes the throughput statistics in measure_cold, which is invoked during state.exec. This has the undesirable effect that throughput statistics are not generated when reads/writes are declared after state.exec.

The statistics are added here.

The only piece of information that is needed from measure_cold is the average CUDA time. However, this information is added to a summary here, so it should be possible to add a post-processing step to retrieve this information and compute throughput statistics after the KernelGenerator is finished.

NVIDIA / nvbench

Throughput statistics are not calculated when reads/writes are declared after `state.exec()` #175