Sources for representative benchmarks

jcrist commented 2 years ago

Over the years we've had good results with profiling specific (and reproducible) benchmarks from the larger pydata community and using the results to make further improvements to dask. Good workloads have come from community blogposts, performance issues from the pangeo community, etc... It would be good to periodically collect good large benchmarks like this to perform and use for motivating and tracking larger performance changes. Random datasets can be great for tests, but aren't always the most representative of real world problems.

Opening this here as a place to solicit/collect links to notebooks/blogposts/etc... that provide reproducible examples we can use to profile and improve dask over time.

mrocklin commented 2 years ago

👍

There are a few different levels of larger benchmark that we could consider, at different levels of effort:

Something that runs well on a laptop
Something that runs well on a distributed cluster
Somethign that runs well on a distributed cluster, on a real-world external dataset

I think that all three have value and that we should embrace them all, but obviously the latter two require additional effort to run. If we work towards all three (which I support) then we should probably separate them a bit so that folks can run all single-machine workflows easily, for example.

jsignell commented 2 years ago

I can imagine https://github.com/dask/dask-benchmarks being repurposed/expanded for this. But the main issue is probably finding good benchmarks, not where to store what we find :)

dask / community

Sources for representative benchmarks #192