iter_batched and iter_batched_ref add significant overhead on Windows

On Windows, iter_batched and iter_batched_ref seem to add a significant amount of overhead compared to iter. The following benchmark file:

#[macro_use]
extern crate criterion;

use criterion::{Criterion, black_box};

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("opt", |b|b.iter(||1234567+2345678));
    c.bench_function("base", |b|b.iter(||black_box(1234567)+black_box(2345678)));
    c.bench_function("batched", |b|b.iter_batched(
            ||(1234567, 2345678),
            |x|x.0+x.1,
            criterion::BatchSize::SmallInput));
    c.bench_function("batched_ref", |b|b.iter_batched_ref(
            ||(1234567, 2345678),
            |x|x.0+x.1,
            criterion::BatchSize::SmallInput));
}

criterion_group!(bench, criterion_benchmark);
criterion_main!(bench);

Gives the following results on Windows:

    Finished release [optimized] target(s) in 0.13s
     Running target\release\deps\repro-78fff178ef4d5c7c.exe

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running target\release\deps\main-369a6cbf410c0c64.exe
Gnuplot not found, disabling plotting
opt                     time:   [297.97 ps 301.34 ps 305.48 ps]
                        change: [-2.2424% -1.2431% -0.2830%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

base                    time:   [892.70 ps 898.79 ps 905.44 ps]
                        change: [-1.0787% -0.1031% +0.8923%] (p = 0.85 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

batched                 time:   [2.1964 ns 2.2214 ns 2.2448 ns]
                        change: [-6.9395% -3.4360% +0.5293%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  10 (10.00%) low severe
  1 (1.00%) high mild

batched_ref             time:   [1.7276 ns 1.7483 ns 1.7707 ns]
                        change: [-3.5868% -0.0206% +3.7878%] (p = 0.99 > 0.05)
                        No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
  9 (9.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

Gnuplot not found, disabling plotting

But the following results on a Linux VM running on the same host PC:

    Finished release [optimized] target(s) in 0.12s
     Running target/release/deps/repro-d030cc95fdd0eae4

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running target/release/deps/main-1d26b1ec3fe084cc
Gnuplot not found, disabling plotting
opt                     time:   [342.10 ps 345.92 ps 350.14 ps]
                        change: [-2.8019% -0.4614% +2.1938%] (p = 0.72 > 0.05)
                        No change in performance detected
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

base                    time:   [919.58 ps 929.65 ps 940.84 ps]
                        change: [-3.8458% -1.9175% +0.1742%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

batched                 time:   [693.36 ps 713.71 ps 735.45 ps]
                        change: [-11.537% -4.4910% +3.0404%] (p = 0.23> 0.05)
                        No change in performance detected.
Found 21 outliers among 100 measurements (21.00%)
  16 (16.00%) low mild
  5 (5.00%) high mild

batched_ref             time:   [777.45 ps 797.09 ps 819.33 ps]
                        change: [-16.002% -10.896% -5.5810%] (p = 0.00< 0.05)
                        Performance has improved.
Found 25 outliers among 100 measurements (25.00%)
  15 (15.00%) low mild
  8 (8.00%) high mild
  2 (2.00%) high severe

Gnuplot not found, disabling plotting

As you can see, iter_batched runs around 3 times faster on Linux than Windows, and iter_batched_ref runs around twice as fast. Compared to base, iter_batched is around 2.5 times slower on Windows and around 1.3 times faster on Linux, while iter_batched_ref is about 2 times slower on Windows and around 1.2 times faster on Linux.

I tried this on current stable and current nightly with similar results. I also tried -msvc and -gnu on Windows with similar results. All measurements were taken with the latest version of criterion.

Hey, thanks for trying Criterion.rs, and thanks for the bug report.

I have to admit, I'm completely mystified by that. I do all of my development on Windows (although generally under WSL, rather than in Windows proper) and I've never noticed this. And yeah, now that you mention it - I do see substantial differences between WSL and windows even running the same benchmarks with the gnu linker.

Unfortunately, I'm not really an expert on the subtle differences in performance between operating systems so I'm not really the right person to figure this out in detail. The fact that I'm seeing it in WSL, and not a Linux VM suggests to me that it might have something to do with Windows vs. Linux versions of rustc, rather than the operating systems, but I can't really back that up.

The only advice I can offer is that a couple nanoseconds of difference in performance is actually pretty low-overhead for most benchmarks. The example benchmark you've shown is kind of a worst-case-scenario for measurement overhead; for more substantial benchmarks the measurement overhead will definitely drop as a percentage of the overall time and should actually drop in absolute value as well. Most benchmarks are large relative to the measurement overhead and so that overhead doesn't really matter much in practice most of the time. If your benchmarks are not substantially larger than the measurement overhead, then I'm afraid I'm going to have to disappoint you. Perhaps check back in the future. I do hope to add support for hardware performance counters at some point; those are better tools for highly precise measurements of very small functions.

bheisler / criterion.rs

iter_batched and iter_batched_ref add significant overhead on Windows #290