expigo / ts_anomaly_detection

0 stars 0 forks source link

Choosing the best value for threshold in the simplest method of detecting anomalies #3

Open expigo opened 2 years ago

expigo commented 2 years ago

In the simplest scenario, RNN is used to reconstruct the training dataset and then the absolute difference between the ground truth and prediction is calculated for every timestamp. At the moment the biggest of these differences is being treated as a threshold value for marking out the anomalies.

expigo commented 2 years ago

The other approach I'm gonna finish soon is based on estimating the kernel density of the differences distribution and then taking its nth quantile as the threshold. I will create a quick summary on performance of both.

expigo commented 2 years ago

The other approach I'm gonna finish soon is based on estimating the kernel density of the differences distribution and then taking its nth quantile as the threshold. I will create a quick summary on performance of both.

After first tests I realised that this idea of using KDE for obtaining the threshold for outlier detection can be used even without the RNN time series reconstruction. I'm gonna try this approach for a reference purposes and I will make some kind of comparison summary soon.

In the meantime I run some tests on RNN+KDE (with bandwidth optimized through grid search CV), and the results are somewhat promising (neglecting the absurd quantile used):

rnn_kde

It's the 11th dataset from hexagon with GT: [11800, 12100] (where every index from that closed range +/- 100 is treated as a correct answer, although the best would be the middle: 11950; this comes from the competition rules). Obviously, similar result can be obtained from just using the maximum absolute difference as a threshold:

rnn_max_loss

One more thing popped into my head: I'm gonna try using the whole dataset (with this small anomaly range included) for training, then settle upon the threshold. I wonder if that makes any sense and what will be the result.

Dataset plot: hexagon_11

model parameters:

model_used