Open alamb opened 1 year ago
I had started working on this, not just for DataFusion, but for other OSS query engines as well.
See https://sqlbenchmarks.io/sqlbench-h/results/env/workstation/sf10/single_node/ for an example.
I was planning on automating this to run nightly, or on each merge to master, but have not found the time to do this yet.
I was planning on automating this to run nightly, or on each merge to master, but have not found the time to do this yet.
This would be fantastic!
How are you considering running the benchmarks in CI @andygrove ?
I've been working on a tool similar to conbench
called Bencher (https://github.com/bencherdev/bencher), and I've been trying to explore the best ways to do continuous benchmarking.
How are you considering running the benchmarks in CI @andygrove ?
Using CI in general would be ideal.
Using the github runners is probably not a great idea as they are quite variable (they are on shared machines and there isn't any visibility into what else is going on on those machines / VMs)
I was thinking about this the other day -- and it seems to me this might be a perfect usecase for a timeseries database (which full disclosure we have at InfluxData). I was thinking that one way to record the history over time is use a widely supported format like LineProtocol -- https://docs.influxdata.com/influxdb/cloud-iox/reference/syntax/line-protocol/
Then we can visualize and display that data over time using existing tools like grafana. Also the "alert if things get slower" sounds a lot like the kinds of alerts used in timeseries databases.
I'll try and whip up some way to convert benchmark results into line protocol over the next few weeks
My high level plan is something like:
@alamb this is effectively what I have built with Bencher. It tracks benchmarks over time, allows you to visualize the results (and share them as an auto-updating README image), and uses statistical thresholds to generate alerts, in the case of performance changes. There's a CLI and a GitHub Action to make it easier to run both locally and in CI.
Using CI in general would be ideal.
Definitely! As for compute, I've heard folks mention https://buildjet.com as an option. In the long run, I'm looking to add AWS Bare Metal runners to Bencher to help with this end of things. So please let me know if you are interested in going that direction.
So I can find the compute resources to run this, and I think we can use existing timeseries databases to track performance over time (e.g. IOx or grafana). The only missing piece is getting the data into a format that can be loaded into these systems
I think we could get a simple version of this feature going like:
It seems as if conbench is now more actively maintained that it once was and claims to have rust support: https://github.com/conbench/conbench
Perhaps it is worth another look
I am interested in spending some time with helping out on benchmarking. Currently checking out conbench, but if another solution is needed I can look into that as well.
I'm interested to know how many permutations we need to track continually (eg. machine size, simd or not, etc).
Looking for any additional thoughts / guidance besides what I have read so far in this issue.
Hi @Smurphy000 that would be amazing 🙏 . This is one of the issues I think is critical to the long term success of DataFusion but has been hard to attract attention for.
The key, in my mind, is to minimize the complexity and infrastructure requirements of this solution, as DataFusion doesn't (yet) have the kind of resources to keep a custom system operating.
I first recommend checking out https://github.com/apache/arrow-datafusion/issues/6107 and seeing if you can write a python script / rust program that takes the json output of a benchmark run and makes a single line for each query run with the relevant parameters. That issue has example data and desired output. You might also have to extend the rust benchmark runner.
In terms of implementation, I suggest starting with one setup only (InfluxData can supply a machine / VM initially -- likely a 8core, 32GB of ram machine) and then we can expand the tested combinations as our needs do as well
So in my mind, the ideal solution looks like:
If you fancy a bit of bash scripting, maybe you could potentially start with bench.sh and extend it to check out the desired SHAs (I could do do this part too, if you were able to do https://github.com/apache/arrow-datafusion/issues/6107)
Here is one way a testing session might look
# setup, fetch all needed data files
./benchmarks/bench.sh data
# Run tpch benchmarh on commit 4819e7a,
# leave results in ./benchmarks/results/4819e7a
# Would add appropriate lines to ./benchmarks/results/history.lp
./benchmarks/bench.sh run tpch --commit 4819e7a
Then we could write up instructions on how to visualize the data in history.lp (I would probably use influxdb/grafana as that is what I know)
Does that make sense?
@alamb the port of InfluxDB over to Rust is super cool. Congrats! I'm considering using it long term, if/when Bencher needs a supplemental backend for results storage.
Before @Smurphy000 goes reinventing the wheel here though, I just wanted to point out that Bencher handles all of this out of the box. All you would need to do is add a flag to your existing runner to output JSON in the expected format, and you're set: https://bencher.dev/docs/explanation/adapters#-json
Bencher would then handle Step 1 (example of continuous benchmark dogfooding with Bencher)
And the Bencher CLI makes it trivial to handle Step 2: bencher run "./benchmarks/bench.sh data"
Then you could also detect performance regressions in CI and even have it comment on PRs: https://bencher.dev/docs/explanation/thresholds/
As far as the orchestration of jobs from GitHub -> dedicated VM, this is something I am actively exploring making easier to do. I'm looking at creating an extension to Bencher for creating GitHub Actions Self-Hosted runners: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners I'm also open to other ideas if you all have any!
@epompeii -- I would love it if someone could explore the possibility -- I haven't had the time to look into whatever bencher supports. There are many different continious benchmarking frameworks / services and the reason I think people keep reinventing the wheel is that the mental effort to integrate with the frameworks (e.g. understanding the bencher json format , understanding what a slug is, or how to map important metadata like "num cpus" to this model) is often larger than creating something custom (or it least appears so from the outside)
In my opinion this is why we don't use conbench -- no one has invested enough time to meld that model with the datafusion numbers
I agree it is infuriatingly repetitive. However unless you (or someone else) directly integrates DataFusion with one of these frameworks in a way that works, I am going to expend my effort explaining what is easiest for me (aka tools I already know), unfortunately.
Currently I am looking into the few options we currently have (conbench, bencher, custom) and I can update here with my findings so some agreement on the best option can be made. I am starting with @alamb idea selfishly to learn a few new things, but I definitely want to take a look into existing frameworks!
@alamb that totally makes sense. I'm the maintainer of Bencher, so if you or @Smurphy000 would like a walk through/to hop on a call, that may be a pretty quick way to map concepts. Long term, I'm hoping that the docs get good enough to make this mapping super easy and intuitive. That's still a work in progress though 😃
To answer your specific example about "num cpus" though, that would be metadata about the Testbed (https://bencher.dev/docs/explanation/benchmarking/#testbed). Currently the best way to have that available in Bencher itself is to include it in the name. You can compare benchmarks results by Testbed. (ex "Debian x86 4 CPU 8GB" vs "Debian x86 8 CPU 16GB")
And definitely explore all your options @Smurphy000 ! If you haven't seen it already, I've compiled a pretty comprehensive list of prior art in the space, that may be helpful here: https://bencher.dev/docs/reference/prior-art/ Let me know if you find anything that you think I should add!
@gruuya and I were talking about this project recently and he said he may have some ideas on how to proceed / push it forward.
Ok, I think I'm up to speed with the history of this issue now.
To me it seems that there are 2 separate action items discussed here:
Also looks like both conbench and bencher are eligible to handle the above.
IMO it's best to try and tackle 1. for now, as this would present a big win, and would pave the way to 2 as well.
@gruuya I would be more than happy to help with the integration of Bencher, which could actually help with both 1. and 2. I set up a similar approach to 1. for Diesel recently: https://github.com/diesel-rs/diesel/pull/3849 It used relative benchmarking, which looks something like this: https://bencher.dev/docs/how-to/track-benchmarks/#relative-benchmarking So you only compare the base commit and the current/merge commit on the same runner, like you want.
For 2. we would just benchmark on pushes to main
and you would be able to visualize them with a Bencher Perf Page, this in an example for the Rustls project.
IMO it's best to try and tackle 1. for now, as this would present a big win, and would pave the way to 2 as well.
I agree this is a good idea / approach.
Note I tried to figure out conbench in the past and I didn't get very far -- though I didn't spend a huge amount of focused time on it.
Ok, so I went ahead and took a stab at 1. https://github.com/apache/arrow-datafusion/pull/9461
Note that there are some issues with properly testing this out, as noted in the description (though i could be missing something GitHub action trick), and so the effort would require a follow up PR.
The idea is to just run the usual benchmarks that devs do locally, and then post a comment with the results to the PR, so nothing fancy.
I can also post some ideas I had about potential follow-up work later on if this makes sense.
Thanks @gruuya -- I'll check out https://github.com/apache/arrow-datafusion/pull/9461 shortly. I am back catching up with alerts
Just some thoughts based on #9800 -- at this moment benchmark results may have significant fluctuations (which is expected), but on the other side, an example from #9461 (https://github.com/splitgraph/arrow-datafusion/pull/1) shows that is fine to use them for tracking some significant performance regressions.
Maybe we should increase 5% "no change" threshold for these benchmarks? It won't be able to show majority of improvements, but it still will be able to highlight if changes are literally breaking in terms of performance.
UPD: another options (or additions to modifying threshold) might be increasing the number of iterations / sleeping between them
Hey @korowa, thanks for your observations! Indeed my own experience suggests a slight bias against the PR results for some reason: typically a couple of queries reported with 1.05-1.3 slow-up (even if nothing is changed), so I'd ignore anything in that range atm (not sure it's worth increasing the 5% threshold though).
That said, I think the current setup is sufficient to catch larger regressions for now. I also don't think increasing the number of iterations / sleeping between them on it's own would be good enough, since we'd trade that against the increased longitudinal performance variance component of the shared GitHub runners, but I guess it might be worth a try.
The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.
Following that I see the list of improvements as:
The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.
I agree with this. How can we make it happen? If we found a monetary sponsor to host the runner, would you be willing to set it up?
Another potential thing to do is to increase the data size / time required per query. As @korowa notes when the queries take only a few 10s of ms to run, the variablity related to the runners is a significant portion of overall query time. I think this will still be a problem with a dedicated runner.
This would of course increase the time required to run benchmarks.
would you be willing to set it?
Yup, I can do that.
Another potential thing to do is to increase the data size / time required per query.
Yeah, I think something like adding TPC-H SF 10 would help somewhat.
track benchmarks over time, through a similar job triggered by merge commits to main (fwiw, I now prefer Bencher to conbench, as it seems simpler to setup/maintain)
@gruuya let me know if you run into anything or have any questions getting setup with Bencher. I would be more than happy to help answer any questions or with parts of the integration work here.
(optional) re-base benchmarks on criterion.rs
If you do move to Criterion, Bencher has a built-in adapter for Criterion which should make things pretty simple.
The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.
@alamb if you all want to build things out yourselves, the Rustls bench runner may be a good starting point. I recently wrote a case study of the Rustls continuous benchmarking setup if that is of interest.
Another possibility is that I am working on Bencher Cloud Bare Metal. It is a fleet of identical hardware and benchmarks are run on the bare metal servers. I can go into more detail if that is something you all want to explore.
Is your feature request related to a problem or challenge? Please describe what you are trying to do. As we make changes to DataFusion, some changes impact performance and some do not. Right now we mostly rely on reviewers to judge when a change could make an impact on performance, and if so run the appropriate benchmarks.
This means that
Describe the solution you'd like I would like
Suggestion
I believe conbench, https://conbench.ursa.dev/, which is partially integrated into the repo already, is intended for exactly this usecase. Using conbench would be nice as it appears to be actively maintained and has resources and is already hosted
The integration is https://github.com/apache/arrow-datafusion/tree/main/conbench and was added in https://github.com/apache/arrow-datafusion/pull/1791 by @dianaclarke
You can see its integration as it posts comments on PRs after merge such as https://github.com/apache/arrow-datafusion/pull/5476#issuecomment-1456748444
Describe alternatives you've considered We could use existing timeseries databases and visualizations like grafana to visualize the information
Additional context