aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
206 stars 33 forks source link

parallel execution test #327

Closed sudiptoguha closed 2 years ago

sudiptoguha commented 2 years ago

Description of changes: RCF has parallelism enabled via a specific thread pool implementation. There has been questions about using such (parameter ranges where parallelism helps etc. etc.). Over the long set of changes from V1.0 to now, it seems that parallelEnabled almost always helps (for a large range of parameters) for a single model. However it also seems that if there are a large number of models (as in high cardinality anomaly detection), it is verifiably better by some percentage to turn off parallelism within a model, but use multiple threads to process different models. The conclusions are testable for different settings of boundingboxcache parameter.