davydovanton / sidekiq-statistic

See statistic about your workers
MIT License
797 stars 81 forks source link

Outlier detection? #49

Open hexgnu opened 9 years ago

hexgnu commented 9 years ago

I like this project seems like a cool idea!

Just wondering would y'all be interested in an outlier detection pull request? Not sure if that's outside the scope of this project or not.

My thinking would be to introduce mean absolute deviation (or MAD), and median to the time statistics hash. Then from there use the industry standard 3 threshold for a pretty robust and simple outlier detection model. Graphically I would just annotate the outliers. It would be cool to somehow tie this to alerting but I feel that might be way out of scope.

Thoughts / concerns?

Cheers :sparkles:

davydovanton commented 9 years ago

Hello @hexgnu! I think it's interesting idea. But I'm not sure that this information will be useful to someone. What do you think @mperham?

hexgnu commented 9 years ago

I know of a few people wanting anomaly detection but it can be a slippery road too. :wink:

Also one more idea would be to introduce stdev to the statistics and then graph what are called bollinger bands (https://en.wikipedia.org/wiki/Bollinger_Bands). Basically allow someone to see the upper and lower bound what is 'normal'.

Of course you could always use the stdev as a measure of volatility too. Max / Min usually works quite well but it can also skew results and over emphasize volatility.

Again just ideas either way I'm not married to them :smile:

mperham commented 9 years ago

This is something I've always wanted, which is why I asked @davydovanton to save each job runtime in a Redis list, but there's a few gotchas:

  1. This feature isn't for finding performance problems, we don't want to build a bad Skylight or NewRelic, but it can be used as a start for comparing performance over a week or a month to find possible regressions.
  2. The number is very coarse so I'm not clear how useful it will be in many cases. Imagine a job that conditionally calls a 3rd party service. The job could take either 5 ms or 5 sec. The resulting band won't be very useful.

So... all that said, I'd love to see it implemented and see how useful it proves to be in the wild.

hexgnu commented 9 years ago

Yea my main fear is I don't want to try and recreate skylight or newrelic either. They do what they do well.

As for the second concern. Assuming you have something so volatile if the worker played out like that and was randomly 5ms or 5000ms over time it'd lead to stdev of 2497.5ms meaning that the bands would be 2502.5 (mean) +/- 3 * 2497.5. After it happened a few times it'd self heal so to speak. 3 is a pretty standard threshold.

I like what you say about detecting regressions after a deploy. That's probably what it would pick up mostly.

I'll work on a PR.

davydovanton commented 9 years ago

wow, @hexgnu it's very interesting for me. I'll wait PR :smiley: