Open expigo opened 2 years ago
The other approach I'm gonna finish soon is based on estimating the kernel density of the differences distribution and then taking its nth quantile as the threshold. I will create a quick summary on performance of both.
The other approach I'm gonna finish soon is based on estimating the kernel density of the differences distribution and then taking its nth quantile as the threshold. I will create a quick summary on performance of both.
After first tests I realised that this idea of using KDE for obtaining the threshold for outlier detection can be used even without the RNN time series reconstruction. I'm gonna try this approach for a reference purposes and I will make some kind of comparison summary soon.
In the meantime I run some tests on RNN+KDE (with bandwidth optimized through grid search CV), and the results are somewhat promising (neglecting the absurd quantile used):
It's the 11th dataset from hexagon with GT: [11800, 12100] (where every index from that closed range +/- 100 is treated as a correct answer, although the best would be the middle: 11950; this comes from the competition rules). Obviously, similar result can be obtained from just using the maximum absolute difference as a threshold:
One more thing popped into my head: I'm gonna try using the whole dataset (with this small anomaly range included) for training, then settle upon the threshold. I wonder if that makes any sense and what will be the result.
Dataset plot:
model parameters:
In the simplest scenario, RNN is used to reconstruct the training dataset and then the absolute difference between the ground truth and prediction is calculated for every timestamp. At the moment the biggest of these differences is being treated as a threshold value for marking out the anomalies.