Investigate refined benchmark sample stopping criterion

Current Approach

Currently, benchmarking can be divided into phases. Each phase might end if the timeout is met.

First Phase

During the first phase, NVBench ensures that min-samples are collected, and the benchmark was run for at least min-time. Stopping criterion is not checked during the first phase, since NVBench has to collect enough samples first.

Second Phase

During the second phase, NVBench analyzes the relative standard deviation of samples. If the relative standard deviation is less than a noise threshold, benchmarking is stopped. I'm not aware of any principle that governs evolution of standard deviation. My guess is that initially, this was intended as a way to reduce effect of outliers. If the number of samples grow, outliers' effect on overall variance reduces. But if a given distribution is wide or consists of multiple modes, the variance is not going to get close to zero.

Third Phase

Third phase is triggered when there are at least 64 samples. During the third phase, NVBench checks the relative standard deviation of noise window (up to 512 noise records) that were collected in the second phase. The resulting relative standard deviation of the noise window is compared with a threshold (5%). If it's less than the threshold, benchmarking is stopped. Otherwise, we switch to the second phase for the next 16 samples. Motivation looks as follows. If the relative standard deviation of samples doesn't change much as we collect more samples, it means that the criterion of the second phase is never going to be satisfied, so we can stop the benchmark.

Issues of Current Approach

Min Time

Minimal benchmarking time leads to excessive number of samples, which complicates later analysis. Below is an example showing that relative standard deviation hasn't changed much after 2500 samples, but NVBench collected 22500 more samples.

01_min_time_for_small_problems

I'd like to specify 0.0 as minimal benchmarking time (disable part of First Phase check), but this leads to the issue below.

Unreliability of Variance

Below is an example illustrating benchmark that satisfied maximal noise after minimal time / minimal samples were collected.

02_no_min_time

Looking at the entropy graph, we can see that it kept growing. This means, that benchmark continued producing unexpected samples. As we gathered more samples, resulting distribution took very different shape.

I take it as an indicator, that looking at the variance at the beginning of the Second Phase is not sufficient.

As I noted above, the evolution of relative standard deviation of elapsed time and its variance are not guaranteed to follow any trend. Examples below illustrate that. Also, note how variance of relative standard deviation (Third Phase) doesn't converge on small value, although the data is noisy. These examples indicate that variance of relative standard deviation is not sufficient to cut down noisy benchmark.

Overall, current criterion relies on terms that are hard to reason about in the context of non-normal distributions. We should investigate alternative stopping criteria to see if the issues above can be addressed.

NVIDIA / nvbench