Open TomAugspurger opened 8 years ago
When benchmarking local changes, I also find asv dev
to be very useful. Not sure it needs to be mentioned in the README, though.
I think we should also have guidelines for benchmarks:
time_xxx
methods take on the order of 100-300 ms if possible (obviously some workloads will need more), so that asv can repeat the method several times and output a stable minimumAnother issue: which timer function should be used? asv's default timer may not be adequate: https://asv.readthedocs.io/en/latest/writing_benchmarks.html#timing
Should we measure CPU time or wallclock time? IMHO we should measure wallclock time: if dask or distributed schedules tasks inefficiently and doesn't make full use of the CPU, it's a problem that should appear in the benchmark results.
@TomAugspurger I'm interested in helping with this, partly as a way to become more familiar with the dask API. Is there anything in particular you would prefer me to target, to start?
@danielballan great, thanks! I'm guessing that @mrocklin, @jcrist, and Antoine have the most knowledge on which parts of dask would be best to benchmark.
My current thinking is that we'll have two kinds of benchmarks: The first are higher-level benchmarks that hit things like top-level methods on dask.array
, dask.bag
, and dask.dataframe
. The second kind of benchmarks are for "internal" methods in places like https://github.com/dask/dask/blob/master/dask/optimize.py.
I think the first kind will be easier to write benchmarks for as you learn the library (that's true for me anyway. ATM I have no idea how to write a good benchmark for something in dask.optimize
).
I agree with @TomAugspurger 's classification of high-level external benchmarks and internal ones.
I also agree that high-level external benchmarks are probably both the more useful and the more approachable. Actually, I'm curious if, as with all things, we can steal from Pandas a bit here. Are there benchmarks in Pandas that are appropriate to take?
There are some extreme things we can test as well, such as doing groupby-applies with small dask dataframes with 1000 partitions, or calling
delayed(sum)([delayed(inc)(i) for i in range(1000)].compute(get=...)
These should be good to stress the administrative side.
Other question: I see a couple of existing benchmarks parameterize on the get
function (multiprocessing.get
, threaded.get
, etc.). Is this useful/desired? What are we trying to achieve here?
@pitrou for a bit, I was thinking these benchmarks could be helpful for users to see the overall performance characteristics of the various backends across different workloads. In hindsight it's probably best to keep this strictly for devs.
I'll send along a PR to remove those when I get a chance. Been swamped lately.
This is a sketch for some sections of documentation that should go in the README.
What to test?
Ideally, benchmarks measure how long our project (dask, distributed) spends doing something, not the underlying libraries they're built on. We want to limit the variance across runs to just code we control.
For example, I suspect
(self.data.a > 0).compute()
is not a great benchmark. My guess (without having profiled) is that the.compute
part takes the majority of the time, most of which would be in pandas / NumPy. (I need to profile all these. I'm reading through dask now to find places where dask is doing a lot of work.)Benchmarking new Code
If you're writing an optimization, say, you can benchmark it by
benchmarks/
repo
field inasv.conf.json
to the path of your dask / distributed repository on your local file systemasv continuous -f 1.1 upstream/master HEAD
(optionally with a regex-b <regex>
to filter to just your benchmark.Naming Conventions
Directory Structure
This repository contains benchmarks for several dask related projects. Each project needs it's own benchmark directory because
asv
is built around one configuration file (asv.conf.json
) and benchmark suite per repository.