The current implementation computes the throughput statistics in measure_cold, which is invoked during state.exec. This has the undesirable effect that throughput statistics are not generated when reads/writes are declared after state.exec.
The only piece of information that is needed from measure_cold is the average CUDA time. However, this information is added to a summary here, so it should be possible to add a post-processing step to retrieve this information and compute throughput statistics after the KernelGenerator is finished.
The current implementation computes the throughput statistics in
measure_cold
, which is invoked duringstate.exec
. This has the undesirable effect that throughput statistics are not generated when reads/writes are declared afterstate.exec
.The statistics are added here.
The only piece of information that is needed from
measure_cold
is the average CUDA time. However, this information is added to a summary here, so it should be possible to add a post-processing step to retrieve this information and compute throughput statistics after theKernelGenerator
is finished.