Huge variance between successive bench runs without code changes.

Hello : 👋

I'm benching a simple key value store and I have noticed huge variances between bench runs with no changes. Below is an output of two successive bench run that illustrates the variance:

    Finished bench [optimized] target(s) in 8.67s
     Running benches/db_benchmark.rs (target/release/deps/db_benchmark-bc743580edd94e51)
small_kv/put            time:   [9.8638 µs 11.195 µs 12.815 µs]
                        thrpt:  [74.420 MiB/s 85.187 MiB/s 96.684 MiB/s]
                 change:
                        time:   [+12.268% +26.990% +43.986%] (p = 0.00 < 0.05)
                        thrpt:  [-30.549% -21.254% -10.927%]
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
small_kv/get            time:   [195.29 µs 202.51 µs 210.08 µs]
                        thrpt:  [4.5396 MiB/s 4.7093 MiB/s 4.8834 MiB/s]
                 change:
                        time:   [+4234.8% +4648.6% +5188.6%] (p = 0.00 < 0.05)
                        thrpt:  [-98.109% -97.894% -97.693%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

First Run

    Finished bench [optimized] target(s) in 0.33s
     Running benches/db_benchmark.rs (target/release/deps/db_benchmark-bc743580edd94e51)
small_kv/put            time:   [6.6374 µs 6.9325 µs 7.3164 µs]
                        thrpt:  [130.35 MiB/s 137.57 MiB/s 143.68 MiB/s]
                 change:
                        time:   [-41.179% -33.312% -24.624%] (p = 0.00 < 0.05)
                        thrpt:  [+32.669% +49.952% +70.007%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
small_kv/get            time:   [99.470 µs 101.21 µs 103.21 µs]
                        thrpt:  [9.2401 MiB/s 9.4230 MiB/s 9.5876 MiB/s]
                 change:
                        time:   [-55.185% -52.739% -50.321%] (p = 0.00 < 0.05)
                        thrpt:  [+101.29% +111.59% +123.14%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Second Run

I have noticed that the results are correlated to the number of iterations. With a lot of iterations (e.g 800K) the throughput reported is quite high than when the iterations a low (e.g 50K).

Is there a way to control the number of iterations in order to get comparable results?

Here is my benching code in case I'm doing something wrong:

pub fn small_kv_benchmark(c: &mut Criterion) {
    let mut rng = rand::thread_rng();
    let tmp_dir =
        TempDir::new(&gen_string(&mut rng, 16)).expect("failed to create temp dir");
    let mut db = GhalaDB::new(tmp_dir.path(), None).unwrap();

    let mut data = (0usize..)
        .map(|_| (gen_bytes(&mut rng, 36usize), gen_bytes(&mut rng, 1000usize)));

    let mut group = c.benchmark_group("small_kv");
    group.throughput(criterion::Throughput::Bytes(1000u64));
    group.bench_function("put", |b| {
        b.iter_batched(
            || data.next().unwrap(),
            |(k, v)| db.put(k, v),
            criterion::BatchSize::SmallInput,
        )
    });
    let tmp_dir =
        TempDir::new(&gen_string(&mut rng, 16)).expect("failed to create temp dir");
    let mut db = GhalaDB::new(tmp_dir.path(), None).unwrap();
    let mut keys = (0usize..1_000_000)
        .map(|_| {
            let (k, v) =
                (gen_bytes(&mut rng, 36usize), gen_bytes(&mut rng, 1000usize));
            db.put(k.clone(), v).ok();
            k
        })
        .collect::<Vec<_>>();
    keys.sort_unstable();
    let mut keys = keys.into_iter();
    group.bench_function("get", |b| {
        b.iter_batched(
            || keys.next().unwrap_or_else(|| gen_bytes(&mut rng, 36usize)),
            |k| db.get(&k),
            criterion::BatchSize::SmallInput,
        )
    });
    group.finish();
}

bheisler / criterion.rs

Huge variance between successive bench runs without code changes. #739