benchmark-action / github-action-benchmark

GitHub Action for continuous benchmarking to keep performance
https://benchmark-action.github.io/github-action-benchmark/dev/bench/
MIT License
1.02k stars 152 forks source link

[Rust] Will this work with Criterion? #8

Open jasonwilliams opened 4 years ago

jasonwilliams commented 4 years ago

A lot of rust projects use Criterion which overrides cargo bench, that may output something different, would this project work with that?

What does the output need to look like?

For e.g im using https://github.com/jasonwilliams/criterion-compare-action which is a wrapper around https://github.com/bheisler/criterion.rs

This is the output from criterion on https://github.com/jasonwilliams/boa:

Create Realm            time:   [389.43 us 391.46 us 393.64 us]
                        change: [+2.9053% +3.3116% +3.7569%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

Symbol Creation         time:   [476.63 us 478.43 us 480.62 us]
                        change: [-8.0982% -7.3473% -6.5933%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

     Running target\release\deps\fib-b7cfe35d42f9c87b.exe
Benchmarking fibonacci (Execution): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 21.1s or reduce sample count to 30
fibonacci (Execution)   time:   [4.0417 ms 4.0513 ms 4.0612 ms]
                        change: [-3.1531% -2.4175% -1.7071%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

     Running target\release\deps\parser-ab35171b3898e94f.exe
Expression (Parser)     time:   [15.656 us 15.673 us 15.693 us]
                        change: [-0.7893% -0.5205% -0.2658%] (p = 0.00 < 0.05)
                        Change within noise threshold.

     Running target\release\deps\string-bdac66e3ac549711.exe
Hello World (Execution) time:   [391.07 us 394.24 us 397.40 us]
                        change: [-3.0040% -2.2291% -1.5351%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) high mild
  12 (12.00%) high severe

Hello World (Lexer)     time:   [1.9896 us 1.9961 us 2.0032 us]
                        change: [-3.2960% -2.6175% -2.0389%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

Hello World (Parser)    time:   [3.8064 us 3.8102 us 3.8141 us]
                        change: [-1.0105% -0.2551% +0.3856%] (p = 0.49 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

https://bheisler.github.io/criterion.rs/book/user_guide/command_line_output.html

rhysd commented 4 years ago

I'm currently not a user of criterion (since I could not run it on my local macOS). However, as far as seeing your output example, we can extract the benchmark result and put remaining outputs to extra field. Your contribution will be more than welcome: https://github.com/rhysd/github-action-benchmark/blob/master/CONTRIBUTING.md

jasonwilliams commented 4 years ago

Thanks @rhysd I had a go here, not sure how to fill in range https://regex101.com/r/8mrOSE/5

I suppose the range would be the high-value - low-value [low-value median-value high-value]

CC @bheisler

bheisler commented 4 years ago

Hey, Criterion.rs maintainer here. The command-line output isn't really intended for programmatic access so the format isn't guaranteed to stay stable. I don't currently plan to change it, so this is more of a heads-up that you may need to fix it if it changes in the future.

Criterion.rs generates a file that is designed for programs rather than humans (see here for more on that) but it contains the raw measurements, not the statistics that Criterion.rs calculates from them, so if you wanted to use that you would need to decide on how you want to summarize those numbers. If you go that direction, I would recommend also reading up on the measurement and analysis process to clarify what some of those numbers mean.

Unrelated: @rhysd - if it's not too much trouble, would you mind explaining your difficulties with Criterion.rs on macOS, or perhaps opening an issue on the Criterion.rs repository? I don't have a Mac to test with but it ought to work.

rhysd commented 4 years ago

@bheisler Thank you for the comment. I did not know that it supports machine-readable format. The format would be preferable.

if it's not too much trouble, would you mind explaining your difficulties with Criterion.rs on macOS

I'm sorry that I cannot remember clearly, but when I tried Criterion example at that time, the benchmark program crashed with SEGV. Since it was about 6 months ago and I used older macOS (10.12) at that time, now it might work fine.

bheisler commented 4 years ago

OK, thanks. If you decide to try it again and run into the same problem, please feel free to file an issue.

rhysd commented 4 years ago

I'm now trying to take over @jasonwilliams's work. I'll notify you if I face an issue. Thank you for taking care.

rhysd commented 4 years ago

I tried to implement simple calculation based on the analysis process document:

const { linear } = require('regression');
const fs = require('fs');

const lines = fs.readFileSync('target/criterion/Bench Fib 10/new/raw.csv', 'utf8').split('\n');
const measurements = lines
    .slice(1)
    .filter(l => l !== '')
    .map(l => {
        const data = l.split(',');
        const measured = parseFloat(data[5]);
        const iter = parseInt(data[7], 10);
        return [iter, measured];
    });

let max = measurements[0][1] / measurements[0][0];
let min = max;
for (const m of measurements.slice(1)) {
    const v = m[1] / m[0];
    if (v > max) {
        max = v;
    }
    if (v < min) {
        min = v;
    }
}

console.log('min:', min);
console.log('estimate:', linear(measurements).equation[0]);
console.log('max:', max);

and output was:

min: 160.11831801157643
estimate: 171.46
max: 196.80116265096436

Criterion.rs's output was

Bench Fib 10            time:   [169.75 ns 171.26 ns 173.01 ns]

so I think there is a difference in calculation of estimation. And min/max seems wrong.

Hmm, I'm not confident to maintain the logic. After all I'm feeling using CLI stdout might be better. Though it may change without major version up of Criterion.rs.

CSV format document says

The results of Criterion.rs' analysis of these measurements are not currently available in machine-readable form. If you need access to this information, please raise an issue describing your use case.

so we might adopt some machine readable output of summary in the future.

jasonwilliams commented 4 years ago

so we might adopt some machine readable output of summary in the future.

Yeah, a good idea would be to have github-action-benchmark accept the CLI output for now, and raise a separate issue for migrating to calculating the CSV logic, its a beefy task and should be worked on separately. The CLI output format won't change any time soon, so we have time to do this properly.

pksunkara commented 4 years ago

Or criterion could add the calculated info to the machine accessable file. @bheisler What do you think?

Or maybe it could support the normal cargo bench format with an option? We want to use this action in https://github.com/clap-rs/clap repo.

bheisler commented 4 years ago

Hey folks. I've been thinking about adding better machine-readable output to this cargo-criterion subcommand I've been working on. Specifically, this would include the calculated statistics, not just the raw measurements.

I'm thinking it would be broadly similar to Cargo's existing --message-format json option. You'd run cargo criterion --message-format json and it would emit JSON messages. Cargo sends the messages over stdout and prints all of its actual output on stderr. I could probably have cargo-criterion do the same, but I'm also considering using a TCP socket on localhost, still undecided there. Naturally, you would have to cargo install cargo-criterion first.

The messages themselves are still TBD, but I do want to keep my options open for the future so I'm thinking that the only field that is guaranteed to be there would be some estimate of a typical value. All of the other stats (mean, variance, MAD, confidence bounds, etc.) would be officially documented as optional so that I still have the option to not report them. For example, I'm imagining an alternate sampling method for long-running benchmarks where we just run the benchmark say 100 times and record each individual time. Those measurements would produce a different set of statistics and I wouldn't want to break downstream code by only sometimes reporting, say, the slope measurements. I'm also thinking I might set it up to parse bencher's statistics from its stdout output - then I could incorporate bencher benchmarks into cargo-criterion's reports and JSON messages, but of course bencher reports far fewer statistics than Criterion.rs does, so there would have to be a lot of things missing from those JSON messages if I did that.

None of this is a promise, mind - I'm still designing a lot of this stuff and figuring out what makes sense. bencher support is particularly dubious, but it's a possibility.

Would that work for what you folks need? Do you have any comments or concerns with this idea so far? Is there anything in particular you want to have in the message format?

pksunkara commented 4 years ago

@bheisler This is already solved with the bencher compat output criterion provides. It is the reason I asked you to add it originally.

brainstorm commented 1 year ago

@bheisler Support for cargo-criterion has been implemented/tested in #145.

epompeii commented 1 year ago

If anyone else is blocked by this/https://github.com/benchmark-action/github-action-benchmark/pull/145, Bencher supports Criterion: https://github.com/bencherdev/bencher#supported-benchmark-harnesses