aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
210 stars 33 forks source link

initiating RCF3.0 #283

Closed sudiptoguha closed 2 years ago

sudiptoguha commented 2 years ago

Description of changes: This PR initiates RCF 3.0.

RCF 2.0 introduced multiple primitives, such as bounding box caching, tree representations, columnar compression to provide a space-time tradeoff. However those primitives were not optimized in any manner. RCF 3.0 reorganizes those primitives to provide a more optimal (and hopefully cleaner) tradeoff.

None of the visitor classes were changed and no inference is expected to vary in this PR. There is unlikely to be change in performance of RCF2.0 and RCF3.0 for large model sizes (bounding box cache 1). The RCF 2.0 files were not changed (except minor bug fixes and plumbing to allow direct comparison).

RCF3.0 will focus exclusively on single precision. Double precision calculations over randomized data structures is unlikely to be helpful in the context of RCF. Moreover the newer logic in RCF 3.0 literally "thinks outside the bounding box". While the bounding box description is easy to follow, relying on such descriptions inside an algorithm may be inimical to algorithmic efficiency. We note that efficiency is a key distinguishing feature of RCFs vis-a-vis random forests.

This PR also overhauls the Rust version to the same exact algorithm as the Java version. RCF 2.0 was designed with Rust and memory safely in mind. The parallel implementation underscores that point. We see the Rust version as a mechanism of additional verification of the algorithmic ideas and expect the different language implementations to remain in sync. None of these versions are superior to the other in performance, and we expect to continue to do everything to keep the Java version as performant as possible.