An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
We are planning to release RCF-3.0-rc1 shortly. Almost all of the improvements discussed in the sequel are available in main currently. The new version improves over RCF 2.0.1 in the following dimensions:
We have updated the internal representation of Compact Random Cut Trees to reduce the memory footprint.
We have optimized the imputation algorithm to improve throughput by copying optimizations from the anomaly scoring algorithm. In particular, the imputation visitor now detects when further node evaluations would not update the in-progress result and short-circuits evaluation.
We have removed the legacy Random Cut Tree implementation from the 1.0 release, simplifying the code base and improving performance.
As a result of the above optimizations, when the bounded box is set to a small value or 0, parallel forest traversals are significantly faster than single-threaded traversals. Users are encouraged to re-evaluate benchmarks based on their parameters.
State classes are used to serialize and deserialize Random Cut Forests within the library. Due to changes in the internal representation of Compact Random Cut Trees, the updated state classes in RCF 3.0 are not backward compatible with with RCF 2.0. We have added a utility class to the library that converts a 2.0 model to a 3.0 model.
A key issue in the consumption of RCFs is the shingling parameter. As an example, for a shingle size 8, a single aberrant value would be in 8 shingled points. As a consequence it is likely that multiple points would get triggered as anomalies. Previously, library users need to rely on downstream post processing to highlight the precise moment of the aberration. However the continuous updates of a stream make this and similar corrections onerous and repetitive across different use cases. ThresholdedRandomCutForest, TRCF, in randomcutforest-parkservice module, provides a single function call which evaluates score, performs thresholding and for an anomalous outputs a potential expected value using the imputation function. The built in thresholding now uses a streaming Kalman-Filter type operation, after an anomaly has been detected, of predicting the score based on the remainder of the shingle and thereby can often correct for the repeated anomaly.
Park Services also introduced
several different common transformations that can be performed in a streaming manner, to transform data on the fly. For example, one can perform Normalized_Difference of the values and then shingle the values for anomaly detection.
an ability to adjoin the time stamp data and use the timestamp value in the anomaly detection.
an ability to impute the missing values on the fly under the assumption that there are not too many missing values — in which case the time series behavior is significantly different.
randomcutforest-examples is augmented to provide a few starter examples of the above thresholding and transformation options.
We have kicked off a new standalone Rust implementation based on the same algorithm and data structures. The Rust version is not yet complete but may provide some additional performance. Going forward, we expect the independent implementations in these two languages to remain in sync with the underlying algorithm and data structures.
We are planning to release RCF-3.0-rc1 shortly. Almost all of the improvements discussed in the sequel are available in main currently. The new version improves over RCF 2.0.1 in the following dimensions:
A key issue in the consumption of RCFs is the shingling parameter. As an example, for a shingle size 8, a single aberrant value would be in 8 shingled points. As a consequence it is likely that multiple points would get triggered as anomalies. Previously, library users need to rely on downstream post processing to highlight the precise moment of the aberration. However the continuous updates of a stream make this and similar corrections onerous and repetitive across different use cases. ThresholdedRandomCutForest, TRCF, in randomcutforest-parkservice module, provides a single function call which evaluates score, performs thresholding and for an anomalous outputs a potential expected value using the imputation function. The built in thresholding now uses a streaming Kalman-Filter type operation, after an anomaly has been detected, of predicting the score based on the remainder of the shingle and thereby can often correct for the repeated anomaly.
We have kicked off a new standalone Rust implementation based on the same algorithm and data structures. The Rust version is not yet complete but may provide some additional performance. Going forward, we expect the independent implementations in these two languages to remain in sync with the underlying algorithm and data structures.