aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
211 stars 34 forks source link

Does the RandomCutForest class calculate CollusiveDisplacement? #55

Closed gorold closed 4 years ago

gorold commented 4 years ago

Hi all,

May I check whether the RandomCutForest class returns the CollusiveDisplacement score by default?

I have looked through the code and documentation and could not really figure out what get_anomaly_score is exactly returning. If it is returning Displacement, is it possible to get a code snippet on how to configure it to return CollusiveDisplacement instead?

Thank you!

sudiptoguha commented 4 years ago

Anomaly_score is returning (normalized) expected inverse height. Scores below 1 is unlikely to be an anomaly and scores much above 1 is likely an anomaly.

You can get Displacement using DynamicScoringRandomCutForest, check getDisplacementScore in RandomCutForestFunctionalTest. Collusive displacement is not available in the library but can be built using the Visitor classes.

gorold commented 4 years ago

Thank you @sudiptoguha for the explanation!

Based on some experiments that I have ran, it seems that expected inverse height gives similar performance to displacement. Do you mind commenting on how expected inverse height fares against displacement and co-displacement?

sudiptoguha commented 4 years ago

Sorry for not getting back to this earlier. The main goal of the RCF library is to provide an environment where all these different functions can be evaluated in a streaming setting, going beyond anomaly detection. See here https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2019/11/random-cut-forests/

We would recommend users to play with different scoring functions! Different scoring functions correspond to different conceptualization of what is an anomaly -- for example displacement provides a hypothesis as described in the original paper. The domain will impact the preference of any scoring function over another, not unlike a particular embedding being more relevant for a specific data in a specific use case.

Performance being similar, other aspects such as simplicity of implementation/reasoning could be tiebreakers.