aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
206 stars 33 forks source link

RCF 4.0 in Rust #397

Closed sudiptoguha closed 9 months ago

sudiptoguha commented 10 months ago

Issue #, if available: #352

Description of changes: This PR upgrades the Rust RCF to be at least on par with Java on many of the features (built in thresholding via a predictor corrector/Kalman style filter, streaming transformations, calibrated conformal forecasts, verbose/non-verbose setups, multitude of bug fixes) and takes two additional major steps forward.

First, as mentioned earlier, here (https://opensearch.org/blog/random-cut-forests/) the tantalizing appeal of RCFs is that every RF computation can be switched to a streaming variant (with changes in avoiding daisy chaining behavior and using randomization more). Towards this end the PR adds a capability of storing (many types of) arbitrary Labels and defining attributes that can be propagated (typical in semi supervision). The constructors are upgraded with builders to provide an easier foundation.

Second, over RCF2.0 and RCF3.0 the model sizes have shrunk about two orders of magnitude and it is possible to run many different models in a single node, see the "hello world" example here (https://opensearch.org/blog/introducing-extensions-for-opensearch/). But the question remains : do we need a thousand models? Clearly a joint model offers possibility of compression, but suffers a risk in terms of fidelity. This PR introduces a MultiTRCF where multiple time series (with their independent streaming transformations) can use the same (few) RCFs. Of course this introduces problems in state partitioning and reinforcement mechanisms typical in multi-armed Bandits (https://en.wikipedia.org/wiki/Multi-armed_bandit) -- which coalesce the older threads of clustering (and RCF itself has a fairly good implicit clustering based on its multi-level construction). Surprisingly, in a commodity laptop which supports 4 threads, the parallel version can process a 1000 updates on a 1000 time series in about 10 seconds with 100MB RAM (which is an order of magnitude plus compression). Check out Rust/tests/multitrcftest.rs and a typical output below for synthetic single dimensional waves with periods uniformly chosen in [30,90) with a shingle size 10. As expected, the spot precision and recall shows loss due to compression of models -- however when late detection is accounted for the precision and recall are healthy. It underscores the fact that RCFs can capture approximate periodicity with some nontrivial shingle size. Note that anomaly detection is fundamentally transductive, and late detection is plausible -- setting it apart from a forecasting/inductive process, where a mixture of periodicity is unhelpful. Notice also the coalescence of the models, where Model 0 seems to be in process of being abandoned.

---- multi_trcf_multi_threaded stdout ---- number of time series: 1000 size 1000 each, 3 arms, parallel enabled: true shingle size: 10, normalization: NORMALIZE, anomalies in total 3022 spot precision 0.5479082321187584 recall 0.40304434149569823 with late detection, precision 0.9437696806117859 recall 0.6942422236929185 nontrivial bandit switches 1385, affirmations 776 model updates across different arms:(0, 2302) (1, 3837) (2, 4003) Model 0, chosen by 84, average period 67.28571428571429, deviation 10.406695908303034 Model 1, chosen by 360, average period 46.75555555555555, deviation 11.880713700139882 Model 2, chosen by 554, average period 66.72563176895306, deviation 15.989182716794764

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.