aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
211 stars 34 forks source link

Revise interface for dynamic-score and dynamic-simulated-score #290

Open jotok opened 2 years ago

jotok commented 2 years ago

In RCF3, we plan to drop the DynamicScoringRandomCutForest class. This class added API methods to the base RandomCutForest class that allowed users to customize the anomaly scoring algorithms by directly passing lambdas to the method. The intention of these methods was to encourage experimentation with the anomaly score algorithm, but I believe these methods are too specialized and not used frequently enough to appear in the API of RandomCutForest. Users have always had the capability to create specialized scoring routines through the traverseForest method, but the signature on this method may be a little daunting to new users. I propose to split the difference and create a new API method that takes a VisitorFactory<Double> or Supplier<Visitor<Double>> and averages the results of visitors across trees.

Current method in DynamicScoringRandomCutForest and example usage:

public double getDynamicScore(double[] point, int ignoreLeafMassThreshold, BiFunction<Double, Double, Double> seen,
            BiFunction<Double, Double, Double> unseen, BiFunction<Double, Double, Double> damp);

public double getDynamicSimulatedScore(double[] point, BiFunction<Double, Double, Double> seen,
            BiFunction<Double, Double, Double> unseen, BiFunction<Double, Double, Double> damp,
            Function<IBoundingBoxView, double[]> vecSep);

double result = forest.getDynamicScore(point, 0, (x, y) -> 1.0 * (x + Math.log(y)), (x, y) -> 1.0 * x, (x, y) -> 1.0);

Proposed methods and example usage:

public double getAverageScalarResult(double[] point, VisitorFactory<Double> visitorFactory);

public double getAverageScalarResult(double[] point, Supplier<Visitor<Double>> supplier);

double result = forest.getAverageScalarResult(point, (tree, y) -> new DynamicScoreVisitor(
                tree.projectToTree(y), tree.getMass(), forest.getIgnoreLeafMassThreshold(), (x, y) -> 1.0 * (x + Math.log(y)), (x, y) -> 1.0 * x, (x, y) -> 1.0);

One issue, seen in the example, is that with the current implementation we have access to forest configuration fields like ignoreLeafMassThreshold. In the proposed implementation, this has to be passed into the closure, which fields a little awkward. We may also want to rethink VisitorFactory interface.