Allow for Post-benchmarking Summarization

sfleischman105 commented 6 years ago

Criterion has been a delight to use so far in my chess engine project, pleco. I'm currently in the process of transitioning to Criterion from the standard library benchmarking.

One thing I think Criterion is missing is a way to display results concisely. The standard library benchmarks simply outputted a number per benchmarked function, making it very easy to see the exact time for each benchmark. Criterion spits out large blobs of text, and while the information is very informative, it's very hard to quickly read through.

Perhaps an (optional) post benchmarking summary could be of use. After every benchmark has been ran, Criterion could spit out a summary of each benchmark result, similar to the standard library.

bheisler commented 6 years ago

Hey, thanks for the suggestion. I'm glad you like Criterion.

The existing output format is intended to be easily readable already, but I agree it could do that better.

BurntSushi commented 6 years ago

As a work-around to this, I've been using something like this:

$ cargo bench -- --verbose | tee bench.log | rg 'time:|thrpt:'

that pretty closely captures what the standard library's benchmark harness will show you. It does use two lines instead of one, but it has much less noise and still saves the more informative output to disk for closer scrutiny.

bheisler commented 6 years ago

I'm curious - which parts of the output do you see as noise and why? The outlier information is not often useful, I suppose. I rely on the percentage-changed and regressed/unchanged/improved display pretty heavily, but you've filtered those out in your example.

BurntSushi commented 6 years ago

@bheisler Great question! So, firstly, I want to say that "noise" was probably a poor word choice on my part. I actually very much like all of the information printed by criterion. It's extremely useful context, and it's why I run with --verbose to get the extra bits about standard deviation, which is also really useful.

However, sometimes I have a lot of benchmarks. For the project I'm working on right now, I count 69 of them, and I actually suspect that might grow. Benchmarks can grow rapidly because of combinatorial factors in the thing you're trying to measure (input size, algorithm). So basically, when I do a benchmark run, what I'd like to be able to do is see a dense bird's eye view of what happen. Ideally, each benchmark would occupy about one line. If I could pick precisely the information I'd want on each line, I think I'd choose something like this:

Benchmark name (including group name).
Average time to run.
Average throughput.
Standard deviation.
If comparing against a baseline, then percent change in average time.
Color could be used to indicate whether criterion detects a statistically significant improvement (green) or regression (red).

That way, I could very quickly scan all of the benchmarks to get a general idea of what happened. If there was no percent change, then I probably don't need to investigate too much more. But if there is, then I can go look at the report criterion currently generates and/or the graphs, which are really really useful.

Hopefully that clarifies things! I'm happy to answer more questions.

gnzlbg commented 6 years ago

I also would like to be able to emit a "summarized" output like cargo bench does, or maybe even something in between that and cargo bench-cmp (e.g. when comparing against a baseline). Ideally with some CLI outputs to control what to plot (e.g. I often only care about how fast the fastest invocation was, instead of being interested in the mean).

anderspapitto commented 5 years ago

Would also like this. One note is that I've been using something akin to @BurntSushi's

$ cargo bench -- --verbose | tee bench.log | rg 'time:|thrpt:'

but if the benchmark name is too long, it's no longer on the same line as time:.

In terms of motivation - one use for me is to look at a pattern over a set of parameterized benchmarks - i.e. something like "at what vector size does this operation fall out of cache and slow down", and it's easiest to do that when there's just one vertical column of numbers to compare. (without delving into plots and so on)

BurntSushi commented 5 years ago

I forgot to update this issue, but I've mostly addressed my desire here with a tool that read's criterion's output: https://github.com/BurntSushi/critcmp

YangKeao commented 5 years ago

@bheisler I have noticed that criterion provides a Report trait. Should we add method for adding report methods into reports list? Maybe simply:

impl Criterion {
    fn add_report(&mut self, report: Box<dyn Report>) {
        self.reports.push(report);
    }
}
impl Reports {
    fn push(&mut self, report: Box<dyn Report>) {
        self.reports.push(report);
    }
}

The only doubt is that whether Report trait is stable. But I don't think it's more important (than the benefits provided by this function).

bheisler commented 5 years ago

The Report trait is not stable, no. The main thing holding this up is the design work necessary to define a trait that makes sense for reports.

tgross35 commented 2 years ago

Just adding in a shell script I use that may be helpful to others. It just wraps the output of cargo bench, but names the file based on the git describe (e.g., v2.1.2_2022-08-23_1647.bench) and adds CPU info, which is helpful if you need to send/archive benchmarks that may have been run on different machines.

#!/bin/sh

# need path-safe UTC datetime
dtime=$(date +"%Y-%m-%d_%H%M" --utc)
describe=$(git describe --always --tags)
fname="benches/results/${describe}_${dtime}.bench"

# Print CPU information to the file
cmd="echo Benchmark from $dtime on commit $describe;"
cmd=${cmd}"rustc --version;"
cmd=${cmd}"printf '\n';"
cmd=${cmd}"echo CPU information:;"
cmd=${cmd}"lscpu | grep -E 'Architecture|Model name|Socket|Thread|CPU\(s\)|MHz';"
cmd=${cmd}"printf '\n\n\n';"
cmd=${cmd}"cargo bench $*;"

eval "$cmd" | tee "$fname"

bheisler / criterion.rs

Allow for Post-benchmarking Summarization #132