Marcnuth / AnomalyDetection

Twitter's Anomaly Detection in Pure Python
Apache License 2.0
301 stars 76 forks source link

median absolute deviation or mean absolute deviation #8

Open ColaBH opened 5 years ago

ColaBH commented 5 years ago

The primary algorithm uses median absolute deviation to replace standard deviation, to make it more robust against anomaly points.

But in this code, pandas.mad() is used. However, pandas.mad() is mean absolute deviation, not median absolute deviation. Both can work, but median absolute deviation is better, in my opinion.

hokiegeek2 commented 5 years ago

@ColaBH Interesting, have you tested both versions to see which is better? @Marcnuth depending upon how testing goes, should this be configurable (median/mean absolute deviation)?

kstseng commented 5 years ago

I think which is better or not may depend on what data look like. In my data, there is no big difference because of my time series data didn't have really big or small value. So the difference between median absolute deviation and mean absolute deviation is not huge. But if your data may have really big or small value, I think the median absolute deviation is more robust.

And if you want to try to use median absolute deviation, you can try the following modification. The original version: https://github.com/Marcnuth/AnomalyDetection/blob/master/anomaly_detection/anomaly_detect_ts.py#L560

ares = ares / data.mad()

And I change to use median absolute deviation:

from statsmodels import robust
ares = ares / robust.mad(data.dropna())

Hope it helps.