aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
211 stars 34 forks source link

releasing RCF 3.0 #304

Closed sudiptoguha closed 1 year ago

sudiptoguha commented 2 years ago

We are planning to release RCF-3.0-rc1 shortly. Almost all of the improvements discussed in the sequel are available in main currently. The new version improves over RCF 2.0.1 in the following dimensions:

A key issue in the consumption of RCFs is the shingling parameter. As an example, for a shingle size 8, a single aberrant value would be in 8 shingled points. As a consequence it is likely that multiple points would get triggered as anomalies. Previously, library users need to rely on downstream post processing to highlight the precise moment of the aberration. However the continuous updates of a stream make this and similar corrections onerous and repetitive across different use cases. ThresholdedRandomCutForest, TRCF, in randomcutforest-parkservice module, provides a single function call which evaluates score, performs thresholding and for an anomalous outputs a potential expected value using the imputation function. The built in thresholding now uses a streaming Kalman-Filter type operation, after an anomaly has been detected, of predicting the score based on the remainder of the shingle and thereby can often correct for the repeated anomaly.

We have kicked off a new standalone Rust implementation based on the same algorithm and data structures. The Rust version is not yet complete but may provide some additional performance. Going forward, we expect the independent implementations in these two languages to remain in sync with the underlying algorithm and data structures.