aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
206 stars 33 forks source link

A multi-centroid clustering based Global-Local-Anomaly-Detector and ForestMode.DISTANCE #364

Closed sudiptoguha closed 1 year ago

sudiptoguha commented 1 year ago

Description of changes: This PR (i) extends the RCFMultiSummarize to operate on a time decay reservoir sample with periodic (self determined) recompilation of the clustering and provides an anomaly detector for an arbitrary type provided a distance function can be defined (and input). That input distance function is the global distance function. At the same time each point can be scored based on a reordering of the cluster representatives based on a local distance from the point being scored -- such a local distance can be based on the point being scored and can be useful in semi-supervision or similar strategies. This reuses components of TRCF and provides a single process() function that simultaneously returns a score/anomaly descriptor and updates internal state similar to TRCF. (ii) It raises a natural question if this new generic algorithm should be applied to numeric vectors -- towards that end the PR also introduced a forest mode DISTANCE for TRCF where the distance function computed in the existing density estimation is exposed as an alternative score (and attribution, whose components sum to the score). It is worth noting that RCF corresponds to a recursive partitioning based strategy which is not unlike partition based clustering. Examples are provided that compare both of these in a dynamic setting. Overall the PR enables the use of the existing pieces of the RCF machinery to provide alternatives and comparisons.

kaituo commented 1 year ago

Tested using AD plugin and tests pass.