Testing: Standardized workflow and datasets for speed and performance benchmarking

OlivierBinette commented 9 months ago

There's currently a good unit test setup for checking functionality.

However, when developing new features, how could one go about benchmarking the performance of alternatives?

This is partly backend-specific, but I think it is important to have a way to check performance, especially if there are functions relying on sklearn, numpy, or other Python packages.

A few questions to answer here would be:

Is there a set of backends that should be prioritized when looking at performance? E.g. 3-5 backends like DuckDB, polars, pandas, PySpark.
How can it be made easy to run performance comparison experiments and keep track of results?

I think the solution to this should be kept as simple as possible. It'd be great to have a class that I can instanciate to configure a performance comparison, to run it, and then to save a short markdown report that contains results and my system configuration.

One thing I'd use this for, specifically, is to compare the performance of sklearn metrics to Ibis implementations. Sklearn metrics are very slow in my experience and I've struggled with them.

NickCrews commented 9 months ago

I agree benchmarking is important.

I would focus on only DuckDB for now, though the framework should get designed so it can be extended to other backends. I say this because if I am working on my laptop, DuckDB is the obvious choice: Why would I use polars or pandas instead of it, they have no advantages I can see? I can see the benefit of spark, but IDK how much benefit there will be for the added complexity, so leave it out for the first iteration?

I think we also need to plan ahead for different versions of duckdb coming out. New versions will have better perf, so we need to hold the version constant. But probably we don't want to force all future benchmarks to only use an ancient version of duckdb, instead we should make it so we can plug in new duckdb into old mismo.

Perhaps https://github.com/airspeed-velocity/asv is a solution? Need to look at how they store results. Ideally in spirit of simplicity, also just as a set of json/yaml files in this repo?? Would be great to avoid an external datastore.

One requirement I would like is to make it easy to backport a benchmark to old versions of mismo. So we write a new benchmark and add it to this repo (potentially store benchmarks in another repo, but if they can all be together that would be better in my mind), but we want a way of running that new benchmark on mismo from 6 months ago.

NickCrews commented 9 months ago

I just sunk 2 hours into this, but eventually just gave up because I didn't find anything that I liked that much :(

thoughts for the next attempt:

NEED: compare a specific commit vs another
NEED: measure time
WANT: measure peak memory, CPU utilization, and arbitrary metrics like accuracy.
WANT: easy to automate running/publishing in CI

Someone else's thoughts on this choosing process.

ASV

not very well maintained
used by scipy, numpy, etc
beautiful static websites that can easily show regressions.
CLI and usage is pretty confusing. More suited to benchmark entire lifetime of a project, not answering one-off questions like "how will this change affect things". It is hard to run a benchmark in the current env, it always wants to spin up a new env.
does support time, memory, peak memory. other options often only do time.
interesting background

Pytest-benchmark

nice and simple
integrates with pytest well.
Nice console reports. Could just print to CI runs and call that good. Or have a GH bot that posts on all PRs. No static website reporting or plots over time :(
only measures time

Pytest-memray

specific to profiling memory usage. Based on the full-featured memray.
well maintained, owned by bloomberg
looks quite simple to just show peak memory used

pytest-monitor

similar to pytest-benchmark
also supports memory usage. doesn't support arbitrary metrics
bad configuration and defaults. eg monitors all tests by default, that was slow and it is stupidly hard to turn that off.
Not very good dovcudmentation
not that popular

NickCrews commented 2 months ago

I have implemented several tests with pytest-benchmark. grep the codebase for "benchmark" and you'll find them. It seems to be working OK for us so far. I'm going to close this as completed, we can start a new issue to iterate on this process if we need.

NickCrews / mismo