Explore regression/anomaly detection

anp commented 6 years ago

The problem: we have a lot of benchmark functions, with more to come. Until we've been running for a while we don't want to wantonly disable benchmarks. So, we need to find a way to surface individual benchmarks in human-readable numbers when we detect performance regressions.

My current thinking is that we shouldn't think of the regression detection as a time series analysis (which is what I'd tried in 2016 and didn't get good SNR). For one thing I don't think we should expect seasonality or other kinds of cyclic signal, despite the fact that this kind of data is most obviously graphed as a time series.

So for right now, I'm thinking that I'm going to try doing some clustering analysis as a slightly braindead anomaly detection. We don't really care about time here, mostly care about surfacing nightlies when they deviate from the previously-observed-best-possible-performance.

Questions:

should a single benchmark run be a single multivariate data point, or should we look for regressions in runtime, instruction count, etc all separately?
what univariate/multivariate clustering mechanism should we use? For example, k-means clustering requires a fixed value for k, while we might in practice find different numbers of clusters for different benchmarks. Ideally we'd have an unsupervised clustering mechanism.

If we can reliably cluster the data points, then I think we can define:

Ideal Cluster: the cluster of points we expect to see if everything is hunky-dory. We'd need some metric for the performance here -- if we have multivariate points this will be tricky, but it's easy for univariate (either "the smallest median" or "the largest median" depending on which metric we have).
Performance Regression: a new benchmark result which is not a member of the Ideal Cluster for that benchmark.

Some links for consideration:

https://en.wikipedia.org/wiki/K-means_clustering

https://en.wikipedia.org/wiki/Cluster_analysis

https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test

https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization

https://en.wikipedia.org/wiki/Kernel_density_estimation

Eh2406 commented 6 years ago

I was thinking of experimenting with this after watching your talk. Without #4, I will have to brean dup without connection to the actual data. I was thinking:

use each benchmark run as single multivariate data point.
for each datum do a log transform, as the time/instruction are positive and we mosty care about % improvements.
for each benchmark fit a multivariate normal distribution, or a Kernel density estimation, to the historical data. Then look at the likelihood of the most recent result given the estimated distribution. I was hoping that if I sorted by that metric, the examples from you're talk would rize to the top.

anp commented 6 years ago

This sounds great! Here's the data I've been working with so far: https://github.com/anp/lolbench-analysis

There's also a jupyter notebook (python 3) that I've been using to inspect the results.

Eh2406 commented 6 years ago

Thanks, I will look into it! Just from a quick glance can I strongly recommend pandas for organizing the data over dictionaries.

anp commented 6 years ago

Yes, you can definitely recommend pandas, but I have been avoiding learning it for longer than I've known Rust :P. At least they're dictionaries of numpy arrays!

EDIT: In all seriousness, yeah I probably should have been using pandas :P.

Eh2406 commented 6 years ago

I could not get your notebook to work locally. So I am working from allresults-Mean.json. I started by experimenting with a KDE approche. The work so far is in this gist. Most of the information is in the graphs at the bottom. The prob row is a measure of how likely this datapoint is given the previous 20 points. Lower numbers are more anomalous. The grafs are sorted from benchmark with the most anomalous signal point to the least.

So far it seams to find anomalous really well. Two well, most of the time they are blips. It also adjust relay fast to new realities. The second day we see a new behavior is not so remarkable.

anp commented 6 years ago

Exciting, those are looking super cool! If the blips it's finding are small enough, do you think that sorting by probability would effectively hide the red herrings?

Eh2406 commented 6 years ago

New version, gist, the thought is that if we are in new behavior then skiping a day wont matter whereas if we are in a blip it will. So prob is now the more likely of this or the next datapoint is given the previous 10 points.

To work thru the examples from the graphs:

rayon_1_0_0::pythagoras::euclid_parallel_weightless: Day 31 infinitely improbable (clipped to -1000) as the nanoseconds is completely out of the range seen before and it is not just one day.
rayon_1_0_0::pythagoras::euclid_parallel_one: Day 29 and 30 have large change in ref-cpu-cycles. Day 31 nanoseconds seems to have changed.
brotli_1_1_3::bench_e2e_decode_q9_5_1024k: nanoseconds and ref-cpu-cycles are odd for those 3 days.
rayon_1_0_0::map_collect::i_mod_10_to_i::with_linked_list_collect_vec: Big jump in nanoseconds for the last two days.
quickcheck_0_6_1::shrink_vec_u8_1_tuple: Everything regrets on day 31.
rayon_1_0_0::map_collect::i_to_i::with_linked_list_collect_vec_sized: Don't know why it thinks day 42 is so odd. Maybe correlations? The combination of low ref-cpu-cycles and hi nanoseconds is new.
rayon_1_0_0::vec_collect::vec_i::with_vec_vec_sized: Really don't know why it thinks day 28 is so odd. It missed day 9 change in nanoseconds because it uses 10 days as training data.

anp commented 5 years ago

I've implemented something based on the gist you shared above, and it should be live at https://blog.anp.lol/lolbench-data soon!

There's definitely more work to do, but for a first pass it's producing some pretty good results!

anp / lolbench

Explore regression/anomaly detection #7