This section uses the following motivating example to show how you should directly put your tiniest operations (such as a single i + 10 addition) into Criterion and it will benchmark it down to the single instruction level:
fn compare_small(c: &mut Criterion) {
use criterion::black_box;
let mut group = c.benchmark_group("small");
group.bench_with_input("unlooped", 10, |b, i| b.iter(|| i + 10));
group.bench_with_input("looped", 10, |b, i| b.iter(|| {
for _ in 0..10000 {
black_box(i + 10);
}
}));
group.finish();
}
Allow me to demonstrate this is nonsense, and you're just benchmarking a pre-computed input being movd:
fn compare_nonsense(c: &mut Criterion) {
use criterion::black_box;
let mut group = c.benchmark_group("small");
group.bench_with_input("unlooped", &10, |b, i| b.iter(|| i + 10));
group.bench_with_input("looped", &10, |b, i| b.iter(|| {
for _ in 0..10000 {
black_box(i + 10);
}
}));
group.bench_with_input("unlooped_nonsense", &10, |b, i| b.iter(|| (*i as f64).sin() as i32));
group.bench_with_input("looped_nonsense", &10, |b, i| b.iter(|| {
for _ in 0..10000 {
black_box((*i as f64).sin() as i32);
}
}));
group.finish();
}
Needless to say, f64::sin does not take 300 picoseconds, it's optimized out to compute it once and re-use that computation.
It is sometimes suggested that benchmarks of small (nanosecond-scale) functions should iterate the function to be benchmarked many times internally to reduce the impact of measurement overhead. This is not required with Criterion.rs, and it is not recommended.
I've had a different experience, I would consider this bad general advice, especially when you need varying input to your function with something like iter_batched_ref which then also adds Vec::extend into your measurement, as well as a bunch of memory bandwidth.
This section uses the following motivating example to show how you should directly put your tiniest operations (such as a single
i + 10
addition) into Criterion and it will benchmark it down to the single instruction level:Allow me to demonstrate this is nonsense, and you're just benchmarking a pre-computed input being
mov
d:Needless to say,
f64::sin
does not take 300 picoseconds, it's optimized out to compute it once and re-use that computation.I've had a different experience, I would consider this bad general advice, especially when you need varying input to your function with something like
iter_batched_ref
which then also addsVec::extend
into your measurement, as well as a bunch of memory bandwidth.