Support for subtractive timing model?

I've always been wondering if it wouldn't make more sense to benchmark very fast operations with a subtractive timing model against a task specific reference noop. The logic of the timing loop would be:

measure iters iterations of executing a noop function.
measure iters iterations of executing the routine function.
return difference of t_routine - t_noop.

In terms of the timing model this would lead to a cancellation of the overhead terms:

elapsed = + (Instant::now + iters * (routine + mem::drop(O) + Range::next))
          - (Instant::now + iters * (noop + mem::drop(O) + Range::next))
        = iters * (routine - noop)

Here is an experimental implementation:

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use std::time::{Duration, Instant};

fn bench_function_with_noop<O, N, R>(iters: u64, mut noop: N, mut routine: R) -> Duration
where
    N: FnMut() -> O,
    R: FnMut() -> O,
{
    let start = Instant::now();
    for _i in 0..iters {
        black_box(noop());
    }
    let t_noop = start.elapsed();

    let start = Instant::now();
    for _i in 0..iters {
        black_box(routine());
    }
    let t_routine = start.elapsed();

    // Durations can only be positive. We have to live with a bias 
    // of truncating for now...
    if let Some(diff) = t_routine.checked_sub(t_noop) {
        diff
    } else {
        std::time::Duration::from_nanos(0)
    }

}

// Example usage:

#[inline]
fn noop<T>(x: T) -> T {
    x
}

fn custom(c: &mut Criterion) {
    c.bench_function("test", |b| b.iter_custom(|iters| {
        bench_function_with_noop(
            iters,
            || noop(black_box(1.0_f64)),
            || black_box(1.0_f64).abs(),
        )
    }));
}

criterion_group!(benches, custom);
criterion_main!(benches);

Does such an approach make sense? If so, could we properly support it by allowing for negative durations in the interface, or even offering versions of iter and iter_batched with noops?

Overall I have the impression that the subtractive timing model gives quite reasonable results. For instance, the following measures integer additions both with the standard model and the subtractive model:

c.bench_function("add_one_op (standard)", |b| b.iter(
    || black_box(1) + black_box(1)
));
c.bench_function("add_ten_ops (standard)", |b| b.iter(
    || black_box(1) +
        black_box(1) +
        black_box(1) +
        black_box(1) +
        black_box(1) +
        black_box(1) +
        black_box(1) +
        black_box(1) +
        black_box(1) +
        black_box(1) +
        black_box(1)
));

c.bench_function("add_one_op (subtractive)", |b| b.iter_custom(|iters| {
    bench_function_with_noop(
        iters,
        || black_box(1),
        || black_box(1) + black_box(1),
    )
}));
c.bench_function("add_ten_ops (subtractive)", |b| b.iter_custom(|iters| {
    bench_function_with_noop(
        iters,
        || black_box(1),
        || black_box(1) +
            black_box(1) +
            black_box(1) +
            black_box(1) +
            black_box(1) +
            black_box(1) +
            black_box(1) +
            black_box(1) +
            black_box(1) +
            black_box(1) +
            black_box(1),
    )
}));

Output:

add_one_op (standard)   time:   [793.89 ps 794.15 ps 794.42 ps]                                   
                        change: [-0.0526% +0.0228% +0.0977%] (p = 0.54 > 0.05)
                        No change in performance detected.

add_ten_ops (standard)  time:   [3.1608 ns 3.1613 ns 3.1619 ns]                                    
                        change: [-0.1555% -0.0772% -0.0004%] (p = 0.05 < 0.05)
                        Change within noise threshold.

add_one_op (subtractive)                                                                             
                        time:   [235.98 ps 237.43 ps 238.81 ps]
                        change: [-1.9516% -0.6050% +0.8046%] (p = 0.40 > 0.05)
                        No change in performance detected.

add_ten_ops (subtractive)                                                                             
                        time:   [2.6036 ns 2.6057 ns 2.6078 ns]
                        change: [-0.0012% +0.1454% +0.2886%] (p = 0.05 > 0.05)
                        No change in performance detected.

Addition operations may not scale exactly linear due to register usage, but it is maybe a reasonable approximation.

In the result with the standard model, the ~790 ps of a single addition seem too high, considering that 10 additions imply a time of ~320 ps per operation (not accounting for the still existing loop overhead).
For the subtractive model, the estimate of ~235 ps for a single operation is fairly consistent with the 10 operation case, and overall seems like good estimate.

The fact that you are seeing such large differences is surprising, at least to me. Criterion already uses what is essentially a subtractive model, running different numbers of iterations and taking a linear regression to eliminate measurement overhead. There is still the "add one and compare to iteration count" overhead, but at that scale modern machines are not remotely serial; I would expect the branch to be perfectly predicted and the "add one and compare" to be done in parallel with the addition that you are testing. If anything I'd expect the add_ten_ops to be less than ten times the runtime of add_one_op because an unrolled loop is easier to parallelize. In general, benchmarking something so small seems fraught since there are so many more variables than "how fast is this".

That said, given the differences you're seeing, it isn't unreasonable to ask for something to get rid of the loop overhead. I wonder if the standard timing loop can be improved to handle the overhead a bit more stably. Something like

for i in 0..max_iters {
    if black_box(i) < iters { // Make sure the loop isn't split
        black_box(routine());
    }
}

would include the same max_iters "add one and compare and compare" overhead in every benchmark, which the linear regression should then remove. This ~~is better than just subtracting noops every time because it~~ only uses a single timing, halving the variance due to measurement overhead.

EDIT: I just tried out something like the above and it performs horribly. Subtraction is probably the best option, sadly.

Criterion already uses what is essentially a subtractive model

Something like [...] would include the same max_iters "add one and compare and compare" overhead in every benchmark, which the linear regression should then remove.

Would it? My understanding was that the linear regression is a fit of elapsed time vs iterations. In terms of the timing model the coefficients are:

elapsed = m * iters + c

m ~ routine + mem::drop(O) + Range::next
y ~ Instant::now

So sampling over multiple iterations can only get rid of the Instant::now y-intercept, but there is no way to get rid of the loop overhead because it gets mixed into the slope m, which is eventually used to report the runtime per iteration.

Am I misinterpreting the fit?

My main motivation for the subtractive model was actually for the batched iteration, where the loop overhead is even bigger, and the results looked like an obvious overestimation. For instance:

c.bench_function("noop batched", move |b| {
    b.iter_batched(|| &0, |&x| black_box(x), BatchSize::SmallInput)
});

noop batched            time:   [1.1672 ns 1.1724 ns 1.1771 ns]

With the subtractive model I can use batched iteration and fully get rid of the bias.

I my opinion it would just be a nice feature to allow trading zero bias for increased variance and vice versa.

Am I misinterpreting the fit?

Not at all for the standard timing loop as it is now. I was proposing increasing the time on every measurement to include more loop overhead so that the loop overhead on every iteration was constant:

elapsed = Range::next * max_iters + (routine + mem::drop) * iters + Instant::now

where max_iters is constant. However, when I tried this, adding all that loop overhead slowed things down immensely (as would be expected when benchmarking something comparable to the loop overhead already), so the idea is a non-starter. I was hoping to avoid a bias-variance tradeoff, but that was my only idea on how to do it.

My main motivation for the subtractive model was actually for the batched iteration, where the loop overhead is even bigger

That makes sense. I agree that the subtractive model is a useful tool to have. And despite me being surprised, you showed clearly (and it still holds true on my machine) that even when the loop overhead is tiny, it matters.

Hey, thanks for the suggestion.

I can see how this would be useful, but I'm wary of doubling the number of Bencher functions to provide with_noop versions of everything. That's a lot of complexity to inflict on users who now have to decide whether they want to benchmark things in the easy way or the accurate way (it is of course not so simple), and what an appropriate noop would be for their case. One of the goals for Criterion.rs is that you don't need to be a statistician to use it.

On the other hand, some folks can make an informed decision about bias/variance tradeoffs and they should have the option to do so.

I try to make the common thing easy and the complex thing possible - that's why I added the iter_custom escape hatch to begin with. What about making this an external crate? You could create an extension trait that defines subtractive versions of the existing timing loops by calling iter_custom and implement it for Bencher. In that case they would just appear to be extra timing loop functions on the Bencher struct as long as the user imported your extension trait. If this turns out to be generally useful we can always evaluate whether to bring it into Criterion.rs itself later.

In lieu of with_noop functions, perhaps iter_batched could do something subtractive by default to reduce the per-iteration overhead, since that is where this problem becomes significant. It doesn't really need a specific noop function, since it is already doing measurements per batch, excluding drop costs; something like

let overhead_start = self.measurement.start();
for inp in inputs.iter_mut() {
    black_box(inp);
}
let overhead_end = self.measurement.end(overhead_start);
self.value = self.measurement.subtract(&self.value, overhead_end);

before the actual timing loop would deal with the problem quite nicely. I think prioritizing bias over variance makes perfect sense in a function that already has warnings about measurement overhead and exists for the purpose of reducing major forms of bias (setup and drop functions). I know that the Measurement trait doesn't have a subtract function, but a (hacky and technically numerically unstable, admittedly) solution would be to tally up both self.value and self.overhead_value separately, subtracting the two at the end in bench where to_f64 is called.

bheisler / criterion.rs

Support for subtractive timing model? #376