Improving Python benchmarking tooling

asmeurer commented 2 years ago

At the NumFOCUS summit a few weeks ago, several people had a conversation about the limitations of asv and the other Python benchmarking tooling. I'd like to use this issue to kick off the discussion about this so that we can gather a wishlist and a list of problems with the existing tooling, and figure out a plan on how to improve things. It's not clear yet whether that will mean improving asv, some other tool, or creating new tools.

Some people who I want to make sure are involved in this discussion (please feel free to CC others):

@jreback @oscarbenjamin @wkerzendorf @jarrodmillman

asmeurer commented 2 years ago

Here's my own wishlist/list of concerns with asv:

My biggest issue with asv is that it is very tied to one specific use case, namely, running a suite of benchmarks across a history of commits for a project, and analysing the history of runtimes. When you stay within asv's designed use-case, it works very well. However, it is very difficult to do anything that strays even a little bit from that use case. Examples of things that are difficult (or even impossible) to do currently with asv:

Run just a single benchmark against master. This requires some confusing command line flags.
Run benchmarks against a development piece of code. i.e., you change some code and want to see how it affects things. Typically if I want to do this I end up just copying the benchmark code and pasting it into an IPython %timeit.

This is hard because asv is designed to download the code from a remote repository (which is hard coded in a config file), and install everything into an isolated virtual environment. It doesn't have a concept of "running a benchmark locally". At best, you can manually edit the config file (which usually is a checked-in file), but it's very annoying.

In general, asv has a specific workflow on how it runs the benchmarks (download the code, create a virtualenv, etc.), and there's virtually no way to run benchmarks that doesn't fit into this paradigm.
It's not possible to run the setup() function per benchmark (https://github.com/airspeed-velocity/asv/issues/966). More generally, there's limited ways that you can configure how the workflow of setup(), run_benchmark(), teardown() happen.
The virtual environment installation is limited in its configurability. In another project I work on at Quansight, I have a very hacky script that overwrites how the virtualenv installation happens, because I have a dependency that must have a specific version installed depending on what commit is used.
It's hard to scale up a benchmarking suite. There's no concept of "fast benchmarks" which should run every time and "slow benchmarks" that should only run, e.g., on every release (similar to "fast" and "slow" tests). I would also say more generally that it's challenging to have any kind of "slow" benchmark (anything that takes more than a second to run) as part of an asv benchmarking suite.

More broadly speaking, I think it's important to understand why we might want to have a benchmark. There are many different use cases for a benchmark. They can

Identify performance regressions
Measure whether a change will improve performance
Compare the performance of similar tasks across different tools
Give empirical evidence of how an algorithm scales (its "big-O")
Provide explicit targets of known important use-cases for optimization

asv is designed around use-case 1., but it is very difficult or impossible to use it for the other use cases.

I will say that there are features of asv that I do like. I like that you can just set it running and it more or less just does what it should (although there are few speed bumps here, like the fact that the percentage it shows while it runs isn't accurate). And I like that it produces nice static graphs that you can easily share on the web.

Outside of asv, but related to this discussion, I think that benchmarking CI hardware is a major problem. Running benchmarks on traditional CI is something that doesn't work well. We have been doing it on SymPy and it's so flaky that I (at least) generally just ignore it. For example, here are the runs on a recent PR of mine that only touches documentation (i.e., it cannot possibly affect any benchmarks), which show multiple benchmarking differences. It's hard to tell whether these are due to the benchmarks themselves being flaky or the hardware having inconsistent timing. Also the general format of the CI output isn't very useful. It would be nice to have a more standardized system that produces nicer outputs.

Again, I don't know what the solution to these concerns should look like yet, whether we should try to expand asv's capabilities or to build new tools (or use existing tools that I'm not yet familiar with), or both. For now, I just want to make sure that all the needs of the community are expressed so we can decide what is most important, and then figure out how to achieve it.

wkerzendorf commented 2 years ago

I would like to add that it would be nice to include a 'pytest-benchmark' study to see if that could replace asv in certain cases

datapythonista commented 2 years ago

I agree with your points, and personally I think it'd help a lot if we had 3 somehow independent components instead of asv how it is now:

A minimal tool to run a single benchmark
A tool to run benchmark suites and generates the output in a standard format
A UI to visualize benchmark results over time

The main problem here is that asv is mostly abandoned and has zero maintainers or contributors right now

asmeurer commented 2 years ago

The main problem here is that asv is mostly abandoned and has zero maintainers or contributors right now

Just to be clear, one of the outcomes of the discussions at the summit was that we might be able to fix this and have some people work on this. But if we do so, it would be good to have a more guided effort. Most of the efforts that projects have put into benchmarking so far have been specific to their own project, and so have always been things that either can't be reused at all or are so tied a specific use-case that they are hard to generalize. That's why I want to start by gathering a list of the current needs so we can try to figure out what a good set of Python benchmarking tools should look like.

mattip commented 2 years ago

Is there a summary of the discussion that already started? What are some of the pain points from others in the NumFOCUS summit, and where was it felt energy should be invested?

dharhas commented 2 years ago

We had a discussion about this today at our nasa-roses monthly project meeting which had representation from Numpy/SciPy/Pandas/SciKit-Learn + @asmeurer. There is consensus that several projects want to work on this in a coordinated way. We decided that opening this issue was a good first step towards coordinating.

@jreback has a similar list of pain points and will be contributing them in this issue.

asmeurer commented 2 years ago

@wkerzendorf did anyone take notes at the NumFOCUS session? All the pain points that were discussed that I can remember I mentioned in my comment above. I think the main takeaway from the discussion was that we should try to collaborate across projects to improve things, rather than continuing to do project-specific workarounds (hence this issue).

jreback commented 2 years ago

from @jbrockmendel an asv wishlist

1) Implement an API; ATM there is only the command line interface, makes it extremely difficult to grok parts of the code in isolation 2) Support 'setup_class' in addition to 'setup', using pytest semantics for when it gets run. This will allow users to avoid unnecessarily re-running potentially expensive setup code. 3) Use pytest semantics for parametrization 4) Make it possible to change/add/refactor benchmarks without nuking the entire history https://github.com/airspeed-velocity/asv/issues/1218 5) Use something more performant than JSON files read/written to move information between processes. 6) Speed benchmark discovery (and avoid re-doing it per-process) xref https://github.com/airspeed-velocity/asv/issues/908 7) Fix the --profile option; I've never gotten it to work. 8) Use something like ccache or cython's caching to speed builds. 9) Refactor to make the environment management, benchmark running, and display/server modular

trexfeathers commented 2 years ago

You're doing a great job of highlighting the mantra that "benchmarking is hard"! It can mean so many different things, and in every case there are many pit-falls to ensure your results are 'true'.

If ASV overcame all these challenges it would be even more complex than it is now. I'd recommend separating concerns into:

The core bits that most use-cases need
- Benchmark runner
- Results databases
- Results visualisation
- ...
The bits that are specific to individual use-cases
- Commit management
- Dependency management
- Scalability measurement
- ...

ASV's modules already help with this, but there is no distinction at a user level. There are probably many ways to improve this (the most extreme version being a core package and plugin packages).

danielsoutar commented 2 years ago

The main problem here is that asv is mostly abandoned and has zero maintainers or contributors right now

Just to be clear, one of the outcomes of the discussions at the summit was that we might be able to fix this and have some people work on this. But if we do so, it would be good to have a more guided effort. Most of the efforts that projects have put into benchmarking so far have been specific to their own project, and so have always been things that either can't be reused at all or are so tied a specific use-case that they are hard to generalize. That's why I want to start by gathering a list of the current needs so we can try to figure out what a good set of Python benchmarking tools should look like.

I'm not a dev on any of the Python scientific libraries, but I really rated asv's visualisations when prototyping/hacking something to work with a C++ codebase I previously worked on a few years ago. I always thought it'd be a cracking tool for benchmarking in general and having to force things through Python was a shame. I'd love to contribute to this if there's a roadmap of some kind, particularly to standardisation of the JSON/inputs to the visual 'backend' of asv.

oscarbenjamin commented 2 years ago

think that benchmarking CI hardware is a major problem. Running benchmarks on traditional CI is something that doesn't work well. We have been doing it on SymPy and it's so flaky that I (at least) generally just ignore it. For example, here are the runs on a recent PR of mine that only touches documentation (i.e., it cannot possibly affect any benchmarks), which show multiple benchmarking differences. It's hard to tell whether these are due to the benchmarks themselves being flaky or the hardware having inconsistent timing.

I think that you are just misinterpreting the output there. The "PR vs master" section shows no changes. However the "master vs previous release" section (correctly) shows that some benchmarks are now running faster as a result of improvements that are not yet released. You can see the benchmark results for the PR that actually made those improvements here: https://github.com/sympy/sympy/pull/23821#issuecomment-1193445

There is some variability in timings but I actually think that running the benchmarks in CI works fairly well for SymPy works fairly well for much of the benchmark suite. The main problem with them for SymPy is that many operations are cached so sometimes the benchmark results report something like a 50% slowdown but it's actually a 50% slowdown on a cached result where the actual time to do something once is much slower than the time reported by asv.

oscarbenjamin commented 2 years ago

To me the biggest limitation of asv for SymPy is that I want to write benchmarks that can be shared across multiple projects in order to compare timings for different software e.g. if there is a benchmark for a particular operation then I want to be able to reuse that for SymPy, SAGE, Pari, Julia, Maxima etc. Basically I don't want the benchmarks themselves to be written as Python code because I want to be able to use them with/from software that doesn't involve Python at all.

This actually extends beyond benchmarking to unit tests as well. There is no real need for unit tests and benchmarks to be different things. They are all just examples of things that can be done with the software. If SymPy's extensive unit test suite was usable to report detailed timing information on each operation then that would be a huge source of benchmarking information but the unit tests are also just written in pytest-style test_ Python functions. Basically I want most of the "test cases" and "benchmarks" to be more like data rather than code and to be usable from many different ecosystems rather than just from Python so that different projects with related functionality don't have to maintain independent test suites and can be compared both for correctness and speed.

tylerjereddy commented 2 years ago

One problem we have with the darshan project is that the Python interface doesn't control the building of the C code, and I don't think there's an asv approach to kind of custom-build C-code first and then build the Python part that uses it via pip.

I can't decide if it is more reasonable to ask the C devs to allow the Python portion of the project to have its own separate build system that also builds/links the C code, or if it would be a fairly minor thing to provide a custom set of build commands to asv to perform on a per-commit basis.

One other thing is that darshan depends on data files/assets a lot for benchmarking because it is a log-parsing library, but needing to place the assets in pip-viable repo is not ideal vs. say having a way to pull in data assets for benchmarking in a more customizable way perhaps? I don't know if that could play in with some of the community work to lean on stuff like pooch for pulling in data assets in a smart way (probably also relates to some comments above about challenges for doing large/slow benchmarks).

asmeurer commented 2 years ago

I can't decide if it is more reasonable to ask the C devs to allow the Python portion of the project to have its own separate build system that also builds/links the C code, or if it would be a fairly minor thing to provide a custom set of build commands to asv to perform on a per-commit basis.

This sounds like a similar sort of problem to the one I described above where I needed to hack around asv's inability to install a different version of a dependency depending on the commit. I think the build stage needs to be much more customizable. Right now it makes some pretty hard assumptions about how the project is built/installed into a virtual environment and how that virtual environment is cached across runs.

I'd also like for virtualenv isolation to be completely separated as a higher level step from actually running the benchmarks, so that you can just "run" the benchmarks against the dev code with the current Python (similar to just running pytest vs. isolating the tests with something like tox).

ngoldbaum commented 2 years ago

A pattern I've noticed reading this over is that people are pretty happy writing benchmark suites with asv (except some nits around parameterized benchmarks) and with asv running in the background on CI generating benchmarks over time as a project evolves, but are unhappy using asv for other benchmarking tasks.

In the past I wrote a set of benchmarks for my unyt library based on the pyperf benchmark harness. I personally found pyperf much nicer to work with and simpler to work with than asv. pyperf's main focus is to run individual stable benchmarks. It has no support for setting up a suite of benchmarks, running benchmarks over a project history, or for setting up isolated testing environments.

I wonder how much work it would be to replace asv's multiprocessing-based bespoke benchmarking with shell calls to pyperf. That way a user could very easily drop down to just using pyperf on a single benchmark if they want to drill down on one thing at a time.

Like others in this thread, I also find asv's codebase hard to grok, with lots of long function implementations that are tied to individual asv command-line options. It might also make sense to write new tools for writing a benchmark suite that runs pyperf under the hood (e.g. pyperformance does this for the Python benchmarks, but is not a general tool) and a tool for running a benchmark suite over a project's history, but given the number of downstream users of asv it might be pragmatic to keep user-visible changes minimal and instead try to refactor asv to make it more approachable.

asmeurer commented 2 years ago

Thanks Nathan. Maybe pyperf is the answer to the "just run a single benchmark" use-case that is so hard with asv right now. I haven't had a chance to use pyperf before, but I like at least in principle some of the features (like the tuning feature). Just to take one of the use-cases that is currently not so easy with pure asv, how hard is it to take an existing benchmark (or set of benchmarks) from an asv-style benchmark suite and run them against some uncommitted development changes for a library? Does this usage already work out of the box or would it require some changes to the benchmarking suite, or some new code that wraps pyperf?

I'm also curious how much of the asv benchmarking running internals are worth keeping and how much is already implemented by pyperf (things like isolating benchmark runs, doing proper statistics, and so on).

By the way, I just now realized for the first time from reading your comment that pyperf and pyperformance are two separate projects.

ngoldbaum commented 2 years ago

how hard is it to take an existing benchmark (or set of benchmarks) from an asv-style benchmark suite and run them against some uncommitted development changes for a library?

Not trivial, but not all that bad really, at least for a simple benchmark. For example, here's a pyperf-based benchmark that uses one of the sympy benchmarks:

import pyperf
from polygon import PolygonAttributes

p = PolygonAttributes()

runner = pyperf.Runner()
runner.bench_func("sympy PolygonAttributes", p.time_create)

To run against a development build of sympy, you'd simply install pyperf and the sympy development version into the python environment you're working in and run pyperf using that environment.

There are a few issues, this API doesn't support setup or teardown functions. There also doesn't appear to be a way to run pyperf from the command line by referring to a python function, the CLI only accepts statements. I guess currently if you wanted to run some setup code you'd need to manually run it in the script first?

I'm also curious how much of the asv benchmarking running internals are worth keeping and how much is already implemented by pyperf (things like isolating benchmark runs, doing proper statistics, and so on).

I think this is mostly all implemented in pyperf but haven't done a detailed comparison.

asmeurer commented 2 years ago

Another thing that I think has only been indirectly referenced here is the ability to specify whether a benchmark can safely be rerun in the same process, or whether a benchmark needs to be rerun in a new process each time, which is also somewhat related to specifying whether setup() runs before each benchmark run https://github.com/airspeed-velocity/asv/issues/966. Running setup() once and keeping the benchmarks within the same process is obviously much faster, but some benchmarks are inaccurate when run multiple times in the same process because of caching or some other internal state that would change after the first run (SymPy benchmarks are a prime example of this because of the SymPy cache).

airspeed-velocity / asv

Improving Python benchmarking tooling #1219