business-science / anomalize

Tidy anomaly detection
https://business-science.github.io/anomalize/
338 stars 60 forks source link

Anomalies detected within bounds #31

Open jh9690 opened 5 years ago

jh9690 commented 5 years ago

I'm running anomalize on large datasets and occasionally come across instances where the anomalize() function finds outliers when the remainder is within the remainder_l1 and remainder_l2 bounds. Theoretically this should not be possible, but unless I'm interpreting the output incorrectly I can't understand this result. In the code below, gesd identifies rows 12 and 16 as anomalies, despite the remainder being greater than the lower bound.

library(tibbletime) library(anomalize)

Create data frame

df <- data.frame(date = c("2003-01-01", "2004-01-01", "2005-01-01", "2006-01-01", "2007-01-01", "2008-01-01", "2009-01-01", "2010-01-01", "2011-01-01", "2012-01-01", "2013-01-01", "2014-01-01", "2015-01-01", "2016-01-01", "2017-01-01", "2018-01-01"), val = c(13.54941, 13.57737, 13.61070, 13.62143, 13.64319, 13.64563, 13.66624, 13.68140, 13.69086, 13.70454, 13.70949, 13.73307, 13.77554, 13.81119, 13.83046, 13.83948))

df$date <- as.Date(df$date)

Convert to tibbletime object

df_tbl <- as_tbl_time(df, index = date)

Run anomalize

results <- df_tbl %>% time_decompose(val, frequency = "auto", trend = "auto", method = "stl") %>% anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.2) %>% time_recompose()