aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
211 stars 34 forks source link

Clarification regarding Shingle size, number of samples per tree and threshold #365

Open prathamk opened 1 year ago

prathamk commented 1 year ago

Hello,

We are trying to understand the implementation of RRCF and have a few queries regarding tuning some of the hyper-parameters. Kindly help us in better understanding, the effects of the following hyper-parameters.

Shingle Size: With shingle size we generally try to capture the shape of the curve which is expected to be repeated periodically. For example, if we have data for every 15 min with pattern repeating every 24 hrs., then shingle size should be (60/15) 24 = 96. And if we have data for every 5 min. with pattern repeating every 24 hrs., then shingle size should be (60/5) 24 = 288.

Is this the right way to decide on the shingle size? As by increasing the shingle size, there is the issue for delayed detection. Also some peaks and dips for shorter duration are not getting detected with the default settings for threshold and anomaly rate. Seems like we need to decrease the lower threshold also as we increase the shingle size. Is this assumption correct?

In the below snapshot, data is collected every 5 min. and shingle size = 288, number Of Trees = 100, sample Size = 1024. Plot 2 In the above plot, shingle size of 288 (capturing the shape of the curve), but it seems miss a lot of anomalies. The score increases, but doesn't cross the lower threshold.

In the below snapshot, data is collected every 5 min. and shingle size = 16, number Of Trees = 100, sample Size = 1024. Plot 3 In the above plot, shingle size of 16 is chosen at random (it doesn't capture the shape of the entire curve), but it seems to better capture the anomalies. Also delay is less in identification of anomalies.

Also keeping this high value for shingle size, would increase the dimensions of the point. Is there any limitation on the size of the shingle?

Number of samples: Is there any recommendation on how many samples one should keep in a tree of the forest? As currently we are randomly choosing the value of number of samples in a tree.

Although as I understand, increasing the sample size provides each tree a larger view of the data, but also increases the running time and size of the state of the forest.

Lower Threshold: The lower threshold is initialized with the value of 1.1. How is this value chosen for the lower threshold and is it recommended to modify this value? What are the factors one need to consider while modifying the lower threshold?

sudiptoguha commented 1 year ago

Thanks for the questions. The best way of thinking about shingle size is to think about "how many data values of immediate past (trajectory) would you need to determine the behavior of the current moment" -- think of the order of a Markov Chain required to describe your data. Note that information theoretically, for a large enough order, small variations (say 10%) should matter less. ShingleSize can be approximate, but needs to be in the correct ballpark.

What your pictures are showing, and as you have found out is that 16 (randomly chosen) works. If you were to try values in the range 12-20, perhaps the results would be similar. This is a robustness feature of RCF.

The number of samples is actually interesting -- with 5 minute delays and 1024 samples, you are actually storing about last 4 days of data (there is recency biased sampling involved, but in the rough ballpark). If you were to increase that to something like 2000 then you should be able to see weekly effects (and lower weekend values may be recognized as non-anomalous). SampleSize does increase model size (and also increases running time, but not by much) .

If you do increase shingle size then the scores will tend to drop -- and the calibration of numbers, such as 1.1, should be lowered as you found out. It is recommended that you change it to reflect your data.

prathamk commented 1 year ago

Thanks @sudiptoguha for your reply.

prathamk commented 1 year ago

Hi @sudiptoguha,

Continuing on previous query regarding shingle size.

In the below snapshot, data is collected every 15 min. and shingle size = 16, number Of Trees = 100, sample Size = 1500 (Just above 2 weeks data). Here the data is scaled raw data. It has a trend repeating every 24 hours. Have scaled the data in below plots by a factor of 100 again, to have better visualization.

There are a lot of peaks in the RCF Score plot. Although it fails to peak for an anomaly between 10th and 11th Feb. Plot 5

Continuing on the same data, with different shingle sizes, rest settings are same.

Shingle Size: 48 Here it captures the anomaly between 10th and 11th Feb, but with high relative index. And still there are lot of peaks in the RCF Score plot. Plot 6

Shingle Size: 96 Here the peaks have been minimized, but with very high relative index. Plot 7

Is there any reasoning for so many peaks in the RCF Score plot? Besides shingle size, what factors can we consider to smoothen the RCF Score plot, i.e., to reduce its sensitivity?

sudiptoguha commented 1 year ago

Let's keep the question detached from specifics of some dataset.

The main question that crystallizes is that "for some settings, an anomaly is detected but relative index (delay in detection) is high; for other settings anomaly is not being detected". That is expected behavior of hyper-parameters. There may simply not be enough information to detect an anomaly very quickly (low relative index) because data is ambiguous. If the information is reduced (low shingle size) then the anomalous segment may appear as non-anomalous (because similar segments were seen before). If the information is increased, then the ambiguity resolves (but then are delays). It is worth noting that the picture that is shown, and we are making a judgement about something being/not an anomaly, corresponds to a very large delay.

Re "so many peaks in the RCF Score plot" --> for smaller shingle sizes the periodic dips in the picture have a lower relative frequency of occurrence, so the scores are higher compared to other scores which are lower. Likewise the scores corresponding to shingles which are flat drop significantly (because their relative frequency is higher). It is worth noting that comparing scores of non-anomalies may not be useful -- the algorithm is geared towards amplifying the score of anomalous points.

You can consider using ThresholdedRandomCutForest directly that attempts to designate anomalies (and you may already be doing so, given the reference to relative index field, which is only available therein).

prathamk commented 1 year ago

Thanks @sudiptoguha for your reply.

sudiptoguha commented 1 year ago

ThresholdedRandomCutForest (PR 385) would reduce thresholding woes. Closing issue - reopen if there is new developments.

sudiptoguha commented 1 year ago

Please check if PR 395 is useful for this discussion. The goal of ThresholdedRandomCutForest was to provide actionable thresholding -- the newer option .setAdjust(true) may alleviate some of the issues here. In a nutshell, it uses the built-in near neighbor computation to compute a conditional field/forecast.