Closed anp closed 5 years ago
I was thinking of experimenting with this after watching your talk. Without #4, I will have to brean dup without connection to the actual data. I was thinking:
This sounds great! Here's the data I've been working with so far: https://github.com/anp/lolbench-analysis
There's also a jupyter notebook (python 3) that I've been using to inspect the results.
Thanks, I will look into it! Just from a quick glance can I strongly recommend pandas
for organizing the data over dictionaries.
Yes, you can definitely recommend pandas, but I have been avoiding learning it for longer than I've known Rust :P. At least they're dictionaries of numpy arrays!
EDIT: In all seriousness, yeah I probably should have been using pandas :P.
I could not get your notebook to work locally. So I am working from allresults-Mean.json
. I started by experimenting with a KDE approche. The work so far is in this gist. Most of the information is in the graphs at the bottom. The prob
row is a measure of how likely this datapoint is given the previous 20 points. Lower numbers are more anomalous. The grafs are sorted from benchmark with the most anomalous signal point to the least.
So far it seams to find anomalous really well. Two well, most of the time they are blips. It also adjust relay fast to new realities. The second day we see a new behavior is not so remarkable.
Exciting, those are looking super cool! If the blips it's finding are small enough, do you think that sorting by probability would effectively hide the red herrings?
New version, gist, the thought is that if we are in new behavior then skiping a day wont matter whereas if we are in a blip it will. So prob
is now the more likely of this or the next datapoint is given the previous 10 points.
To work thru the examples from the graphs:
nanoseconds
is completely out of the range seen before and it is not just one day.ref-cpu-cycles
. Day 31 nanoseconds
seems to have changed.nanoseconds
and ref-cpu-cycles
are odd for those 3 days.nanoseconds
for the last two days.ref-cpu-cycles
and hi nanoseconds
is new.nanoseconds
because it uses 10 days as training data.I've implemented something based on the gist you shared above, and it should be live at https://blog.anp.lol/lolbench-data soon!
There's definitely more work to do, but for a first pass it's producing some pretty good results!
The problem: we have a lot of benchmark functions, with more to come. Until we've been running for a while we don't want to wantonly disable benchmarks. So, we need to find a way to surface individual benchmarks in human-readable numbers when we detect performance regressions.
My current thinking is that we shouldn't think of the regression detection as a time series analysis (which is what I'd tried in 2016 and didn't get good SNR). For one thing I don't think we should expect seasonality or other kinds of cyclic signal, despite the fact that this kind of data is most obviously graphed as a time series.
So for right now, I'm thinking that I'm going to try doing some clustering analysis as a slightly braindead anomaly detection. We don't really care about time here, mostly care about surfacing nightlies when they deviate from the previously-observed-best-possible-performance.
Questions:
k
, while we might in practice find different numbers of clusters for different benchmarks. Ideally we'd have an unsupervised clustering mechanism.If we can reliably cluster the data points, then I think we can define:
Some links for consideration:
https://en.wikipedia.org/wiki/K-means_clustering
https://en.wikipedia.org/wiki/Cluster_analysis
https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test
https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization
https://en.wikipedia.org/wiki/Kernel_density_estimation