aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
211 stars 34 forks source link

RCF4.0 and PredictiveRCF #401

Closed sudiptoguha closed 9 months ago

sudiptoguha commented 11 months ago

Description of changes: This PR initiates RCF 4.0. The primary change is the realization that while RCF has been built in layers over time -- some of the streaming normalization is standard yet extremely useful. This preprocessing functionality existed in ParkServices and yet it is increasingly clear that using RCFs without these normalizations are not really helpful. Thus the entire preprocessing is now shifted to core rationalizing the configs with the code. As an example benefit, we introduce a PredictiveRCF that update on vectors over attribute dimensions A and B, and given values of the dimension A provides a clustering over candidate values in dimensions B. This capability existed in the imputeMissingValues() -- but the addition of of the preprocessing (and the inverse map) alongside exposure of the clustering would likely be useful. As a consequence we can use this predictor to estimate the errors of forecasting in RCFCast, reducing the amount of state required for calibration of the output.

In addition newer tests have been added and the coverage of ParkServices is significantly higher.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

kaituo commented 10 months ago

For TimedRangeVector, package name 'com.amazon.randomcutforest.parkservices.returntypes' does not correspond to the file path 'com.amazon.randomcutforest.returntypes'.

kaituo commented 10 months ago

Should we change the following code

public GenericAnomalyDescriptor(List<Weighted<P>> representative, double score, double threshold,
            double anomalyGrade) {
        this.representativeList = representativeList;

to

public GenericAnomalyDescriptor(List<Weighted<P>> representative, double score, double threshold,
            double anomalyGrade) {
        this.representativeList = representative;

?

kaituo commented 10 months ago

In the constructor of ImputeVisitor, are we missing four fields including box, converged, pointIndex, and randomRank?

ImputeVisitor(ImputeVisitor original) {
        int length = original.queryPoint.length;
        this.queryPoint = Arrays.copyOf(original.queryPoint, length);
        this.missing = Arrays.copyOf(original.missing, length);
        this.dimensionsUsed = new int[original.dimensionsUsed.length];
        this.randomSeed = new Random(original.randomSeed).nextLong();
        this.centrality = original.centrality;
        anomalyRank = DEFAULT_INIT_VALUE;
        distance = DEFAULT_INIT_VALUE;
    }
sudiptoguha commented 10 months ago

For TimedRangeVector, package name 'com.amazon.randomcutforest.parkservices.returntypes' does not correspond to the file path 'com.amazon.randomcutforest.returntypes'.

fixed. Thx.

sudiptoguha commented 10 months ago

Should we change the following code

public GenericAnomalyDescriptor(List<Weighted<P>> representative, double score, double threshold,
            double anomalyGrade) {
        this.representativeList = representativeList;

to

public GenericAnomalyDescriptor(List<Weighted<P>> representative, double score, double threshold,
            double anomalyGrade) {
        this.representativeList = representative;

?

Fixed. Thx.

sudiptoguha commented 10 months ago

In the constructor of ImputeVisitor, are we missing four fields including box, converged, pointIndex, and randomRank?

ImputeVisitor(ImputeVisitor original) {
        int length = original.queryPoint.length;
        this.queryPoint = Arrays.copyOf(original.queryPoint, length);
        this.missing = Arrays.copyOf(original.missing, length);
        this.dimensionsUsed = new int[original.dimensionsUsed.length];
        this.randomSeed = new Random(original.randomSeed).nextLong();
        this.centrality = original.centrality;
        anomalyRank = DEFAULT_INIT_VALUE;
        distance = DEFAULT_INIT_VALUE;
    }

We are missing them -- and that is intentional :) But perhaps the naming is inappropriate -- this is a private constructor to be invoked by copy() -- which is the mistake. Renames copy -> partialCopy(), which is the intention. These values which are copied are fixed for the query -- the other values are provided by the leaves in different branches. The partial copy is triggered when the partitioning coordinate is the missing value.

kaituo commented 10 months ago

read until Java/core/src/main/java/com/amazon/randomcutforest/preprocessor/ImputePreprocessor.java of commit https://github.com/aws/random-cut-forest-by-aws/pull/401/commits/c8721d4a2b5f7c513153fe1d9279ac68ff546c9d