alamb commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. As we make changes to DataFusion, some changes impact performance and some do not. Right now we mostly rely on reviewers to judge when a change could make an impact on performance, and if so run the appropriate benchmarks.

This means that

We may miss some performance regressions (such as https://github.com/apache/arrow-datafusion/issues/5325)
Since the benchmarks are not run regularly it is hard to know how to interpret results, and some seem to have bitrotted over time
The wide variety of available benchmarks (e.g. https://github.com/apache/arrow-datafusion/issues/5502) makes it hard to know which ones to run and how to determine if performance has improved or regressed for particular changes

Describe the solution you'd like I would like

A system that runs DataFusion benchmarks regularly on main
Some automated way to see if a particular PR has improved or regressed performance
Bonus: a webpage that shows performance over time. Databend has a great example https://perf.databend.rs/

Suggestion

I believe conbench, https://conbench.ursa.dev/, which is partially integrated into the repo already, is intended for exactly this usecase. Using conbench would be nice as it appears to be actively maintained and has resources and is already hosted

The integration is https://github.com/apache/arrow-datafusion/tree/main/conbench and was added in https://github.com/apache/arrow-datafusion/pull/1791 by @dianaclarke

You can see its integration as it posts comments on PRs after merge such as https://github.com/apache/arrow-datafusion/pull/5476#issuecomment-1456748444

Describe alternatives you've considered We could use existing timeseries databases and visualizations like grafana to visualize the information

Additional context

andygrove commented 1 year ago

I had started working on this, not just for DataFusion, but for other OSS query engines as well.

See https://sqlbenchmarks.io/sqlbench-h/results/env/workstation/sf10/single_node/ for an example.

I was planning on automating this to run nightly, or on each merge to master, but have not found the time to do this yet.

ozankabak commented 1 year ago

I was planning on automating this to run nightly, or on each merge to master, but have not found the time to do this yet.

This would be fantastic!

epompeii commented 1 year ago

How are you considering running the benchmarks in CI @andygrove ? I've been working on a tool similar to conbench called Bencher (https://github.com/bencherdev/bencher), and I've been trying to explore the best ways to do continuous benchmarking.

alamb commented 1 year ago

How are you considering running the benchmarks in CI @andygrove ?

Using CI in general would be ideal.

Using the github runners is probably not a great idea as they are quite variable (they are on shared machines and there isn't any visibility into what else is going on on those machines / VMs)

alamb commented 1 year ago

I was thinking about this the other day -- and it seems to me this might be a perfect usecase for a timeseries database (which full disclosure we have at InfluxData). I was thinking that one way to record the history over time is use a widely supported format like LineProtocol -- https://docs.influxdata.com/influxdb/cloud-iox/reference/syntax/line-protocol/

Then we can visualize and display that data over time using existing tools like grafana. Also the "alert if things get slower" sounds a lot like the kinds of alerts used in timeseries databases.

I'll try and whip up some way to convert benchmark results into line protocol over the next few weeks

alamb commented 1 year ago

My high level plan is something like:

[ ] https://github.com/apache/arrow-datafusion/issues/6107
[ ] Write a script to run the benchmark and generate line protocol with a specific git repo / sha
[ ] write up instructions on how to visualize that data (maybe locally with grafana)

epompeii commented 1 year ago

@alamb this is effectively what I have built with Bencher. It tracks benchmarks over time, allows you to visualize the results (and share them as an auto-updating README image), and uses statistical thresholds to generate alerts, in the case of performance changes. There's a CLI and a GitHub Action to make it easier to run both locally and in CI.

Using CI in general would be ideal.

Definitely! As for compute, I've heard folks mention https://buildjet.com as an option. In the long run, I'm looking to add AWS Bare Metal runners to Bencher to help with this end of things. So please let me know if you are interested in going that direction.

alamb commented 1 year ago

So I can find the compute resources to run this, and I think we can use existing timeseries databases to track performance over time (e.g. IOx or grafana). The only missing piece is getting the data into a format that can be loaded into these systems

I think we could get a simple version of this feature going like:

Invoke bench.sh for the benchmarks of interest
Use https://github.com/apache/arrow-datafusion/issues/6107 to convert the existing bench output to lineprotocol (I wrote up more details of this transformation)
Check in the line protocol somewhere
Upload lineprotocol to influxdb or some other system that supports it to make some simple visualizations

alamb commented 1 year ago

It seems as if conbench is now more actively maintained that it once was and claims to have rust support: https://github.com/conbench/conbench

Perhaps it is worth another look

Smurphy000 commented 1 year ago

I am interested in spending some time with helping out on benchmarking. Currently checking out conbench, but if another solution is needed I can look into that as well.

I'm interested to know how many permutations we need to track continually (eg. machine size, simd or not, etc).

Looking for any additional thoughts / guidance besides what I have read so far in this issue.

alamb commented 1 year ago

Hi @Smurphy000 that would be amazing 🙏 . This is one of the issues I think is critical to the long term success of DataFusion but has been hard to attract attention for.

The key, in my mind, is to minimize the complexity and infrastructure requirements of this solution, as DataFusion doesn't (yet) have the kind of resources to keep a custom system operating.

Step 1: transform benchmark data for graphing

I first recommend checking out https://github.com/apache/arrow-datafusion/issues/6107 and seeing if you can write a python script / rust program that takes the json output of a benchmark run and makes a single line for each query run with the relevant parameters. That issue has example data and desired output. You might also have to extend the rust benchmark runner.

In terms of implementation, I suggest starting with one setup only (InfluxData can supply a machine / VM initially -- likely a 8core, 32GB of ram machine) and then we can expand the tested combinations as our needs do as well

Step 2: script to gather data for each commit

So in my mind, the ideal solution looks like:

A runner script that runs the benchmarks, and appends the results to some sort of text file (ideally https://github.com/apache/arrow-datafusion/issues/6107) that we can check in and that is easy to visualize
Written in one of the existing languages used in the DataFusion repo: python, bash, or rust

If you fancy a bit of bash scripting, maybe you could potentially start with bench.sh and extend it to check out the desired SHAs (I could do do this part too, if you were able to do https://github.com/apache/arrow-datafusion/issues/6107)

Here is one way a testing session might look

# setup, fetch all needed data files
./benchmarks/bench.sh data 

# Run tpch benchmarh on commit 4819e7a, 
# leave results in ./benchmarks/results/4819e7a
# Would add appropriate lines to  ./benchmarks/results/history.lp
./benchmarks/bench.sh run tpch --commit 4819e7a

Then we could write up instructions on how to visualize the data in history.lp (I would probably use influxdb/grafana as that is what I know)

Does that make sense?

epompeii commented 1 year ago

@alamb the port of InfluxDB over to Rust is super cool. Congrats! I'm considering using it long term, if/when Bencher needs a supplemental backend for results storage.

Before @Smurphy000 goes reinventing the wheel here though, I just wanted to point out that Bencher handles all of this out of the box. All you would need to do is add a flag to your existing runner to output JSON in the expected format, and you're set: https://bencher.dev/docs/explanation/adapters#-json Bencher would then handle Step 1 (example of continuous benchmark dogfooding with Bencher) And the Bencher CLI makes it trivial to handle Step 2: bencher run "./benchmarks/bench.sh data" Then you could also detect performance regressions in CI and even have it comment on PRs: https://bencher.dev/docs/explanation/thresholds/

As far as the orchestration of jobs from GitHub -> dedicated VM, this is something I am actively exploring making easier to do. I'm looking at creating an extension to Bencher for creating GitHub Actions Self-Hosted runners: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners I'm also open to other ideas if you all have any!

alamb commented 1 year ago

@epompeii -- I would love it if someone could explore the possibility -- I haven't had the time to look into whatever bencher supports. There are many different continious benchmarking frameworks / services and the reason I think people keep reinventing the wheel is that the mental effort to integrate with the frameworks (e.g. understanding the bencher json format , understanding what a slug is, or how to map important metadata like "num cpus" to this model) is often larger than creating something custom (or it least appears so from the outside)

In my opinion this is why we don't use conbench -- no one has invested enough time to meld that model with the datafusion numbers

I agree it is infuriatingly repetitive. However unless you (or someone else) directly integrates DataFusion with one of these frameworks in a way that works, I am going to expend my effort explaining what is easiest for me (aka tools I already know), unfortunately.

Smurphy000 commented 1 year ago

Currently I am looking into the few options we currently have (conbench, bencher, custom) and I can update here with my findings so some agreement on the best option can be made. I am starting with @alamb idea selfishly to learn a few new things, but I definitely want to take a look into existing frameworks!

epompeii commented 1 year ago

@alamb that totally makes sense. I'm the maintainer of Bencher, so if you or @Smurphy000 would like a walk through/to hop on a call, that may be a pretty quick way to map concepts. Long term, I'm hoping that the docs get good enough to make this mapping super easy and intuitive. That's still a work in progress though 😃

To answer your specific example about "num cpus" though, that would be metadata about the Testbed (https://bencher.dev/docs/explanation/benchmarking/#testbed). Currently the best way to have that available in Bencher itself is to include it in the name. You can compare benchmarks results by Testbed. (ex "Debian x86 4 CPU 8GB" vs "Debian x86 8 CPU 16GB")

And definitely explore all your options @Smurphy000 ! If you haven't seen it already, I've compiled a pretty comprehensive list of prior art in the space, that may be helpful here: https://bencher.dev/docs/reference/prior-art/ Let me know if you find anything that you think I should add!

alamb commented 8 months ago

@gruuya and I were talking about this project recently and he said he may have some ideas on how to proceed / push it forward.

gruuya commented 8 months ago

Ok, I think I'm up to speed with the history of this issue now.

To me it seems that there are 2 separate action items discussed here:

CI benchmark that will comment on PRs once merged to main, and possibly on open PRs after invoking some hook (points 1 and 2 in the issue description). This will be just a summary of the perf comparison between the base commit and the current/merge commit. This might be done using github shared runners, as presumably the variance wouldn't be too wild during the same run.
As an extension of 1. above, a way to persist this info, and ideally visualize it over time. This would involve a bit more effort, as it would require setting up (and maintaining) dedicated infra to control for the variance. I guess the idea was to be checking in the benchmark numbers into some(this?) github repo, as opposed to a cloud DB somewhere?

Also looks like both conbench and bencher are eligible to handle the above.

IMO it's best to try and tackle 1. for now, as this would present a big win, and would pave the way to 2 as well.

epompeii commented 8 months ago

@gruuya I would be more than happy to help with the integration of Bencher, which could actually help with both 1. and 2. I set up a similar approach to 1. for Diesel recently: https://github.com/diesel-rs/diesel/pull/3849 It used relative benchmarking, which looks something like this: https://bencher.dev/docs/how-to/track-benchmarks/#relative-benchmarking So you only compare the base commit and the current/merge commit on the same runner, like you want.

For 2. we would just benchmark on pushes to main and you would be able to visualize them with a Bencher Perf Page, this in an example for the Rustls project.

alamb commented 8 months ago

IMO it's best to try and tackle 1. for now, as this would present a big win, and would pave the way to 2 as well.

I agree this is a good idea / approach.

Note I tried to figure out conbench in the past and I didn't get very far -- though I didn't spend a huge amount of focused time on it.

gruuya commented 8 months ago

Ok, so I went ahead and took a stab at 1. https://github.com/apache/arrow-datafusion/pull/9461

Note that there are some issues with properly testing this out, as noted in the description (though i could be missing something GitHub action trick), and so the effort would require a follow up PR.

The idea is to just run the usual benchmarks that devs do locally, and then post a comment with the results to the PR, so nothing fancy.

I can also post some ideas I had about potential follow-up work later on if this makes sense.

alamb commented 8 months ago

Thanks @gruuya -- I'll check out https://github.com/apache/arrow-datafusion/pull/9461 shortly. I am back catching up with alerts

korowa commented 8 months ago

Just some thoughts based on #9800 -- at this moment benchmark results may have significant fluctuations (which is expected), but on the other side, an example from #9461 (https://github.com/splitgraph/arrow-datafusion/pull/1) shows that is fine to use them for tracking some significant performance regressions.

Maybe we should increase 5% "no change" threshold for these benchmarks? It won't be able to show majority of improvements, but it still will be able to highlight if changes are literally breaking in terms of performance.

UPD: another options (or additions to modifying threshold) might be increasing the number of iterations / sleeping between them

gruuya commented 8 months ago

Hey @korowa, thanks for your observations! Indeed my own experience suggests a slight bias against the PR results for some reason: typically a couple of queries reported with 1.05-1.3 slow-up (even if nothing is changed), so I'd ignore anything in that range atm (not sure it's worth increasing the 5% threshold though).

That said, I think the current setup is sufficient to catch larger regressions for now. I also don't think increasing the number of iterations / sleeping between them on it's own would be good enough, since we'd trade that against the increased longitudinal performance variance component of the shared GitHub runners, but I guess it might be worth a try.

The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.

Following that I see the list of improvements as:

(optional) simplify CI benchmark development: https://github.com/apache/arrow-datafusion/issues/9638
add more benchmarks (a selection of Clickbench + TPC-DS queries)
track benchmarks over time, through a similar job triggered by merge commits to main (fwiw, I now prefer Bencher to conbench, as it seems simpler to setup/maintain)
(optional) re-base benchmarks on criterion.rs; this provides a neat standardized way of running/analyzing/collecting these stats in as-of-recently a bit user-friendlier manner: https://www.tweag.io/blog/2022-03-03-criterion-rs/

alamb commented 8 months ago

The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.

I agree with this. How can we make it happen? If we found a monetary sponsor to host the runner, would you be willing to set it up?

Another potential thing to do is to increase the data size / time required per query. As @korowa notes when the queries take only a few 10s of ms to run, the variablity related to the runners is a significant portion of overall query time. I think this will still be a problem with a dedicated runner.

This would of course increase the time required to run benchmarks.

gruuya commented 8 months ago

would you be willing to set it?

Yup, I can do that.

Another potential thing to do is to increase the data size / time required per query.

Yeah, I think something like adding TPC-H SF 10 would help somewhat.

epompeii commented 7 months ago

track benchmarks over time, through a similar job triggered by merge commits to main (fwiw, I now prefer Bencher to conbench, as it seems simpler to setup/maintain)

@gruuya let me know if you run into anything or have any questions getting setup with Bencher. I would be more than happy to help answer any questions or with parts of the integration work here.

(optional) re-base benchmarks on criterion.rs

If you do move to Criterion, Bencher has a built-in adapter for Criterion which should make things pretty simple.

The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.

@alamb if you all want to build things out yourselves, the Rustls bench runner may be a good starting point. I recently wrote a case study of the Rustls continuous benchmarking setup if that is of interest.

Another possibility is that I am working on Bencher Cloud Bare Metal. It is a fleet of identical hardware and benchmarks are run on the bare metal servers. I can go into more detail if that is something you all want to explore.

apache / datafusion

Run DataFusion benchmarks regularly and track performance history over time #5504

Step 1: transform benchmark data for graphing

Step 2: script to gather data for each commit