aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
211 stars 34 forks source link

Thresholded Random Cut Forest not detecting some anomalies with small gap #359

Closed kkondaka closed 1 year ago

kkondaka commented 1 year ago

Example sequence of events

  1. 1000 samples of normal data
  2. 1 sample with anomaly data (detected by the RCF algorithm)
  3. 1 sample of normal data
  4. 1 sample with anomaly data (NOT detected by the RCF algorithm)

example code

/*
 * Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License").
 * You may not use this file except in compliance with the License.
 * A copy of the License is located at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * or in the "license" file accompanying this file. This file is distributed
 * on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
 * express or implied. See the License for the specific language governing
 * permissions and limitations under the License.
 */

package com.amazon.randomcutforest.examples.parkservices;

import com.amazon.randomcutforest.config.ForestMode;
import com.amazon.randomcutforest.config.Precision;
import com.amazon.randomcutforest.config.TransformMethod;
import com.amazon.randomcutforest.examples.Example;
import com.amazon.randomcutforest.parkservices.AnomalyDescriptor;
import com.amazon.randomcutforest.parkservices.ThresholdedRandomCutForest;

import java.util.Random;

public class T implements Example {

    public static void main(String[] args) throws Exception {
        new T().run();
    }

    @Override
    public String command() {
        return "Thresholded_example";
    }

    @Override
    public String description() {
        return "Thresholded Example";
    }

    @Override
    public void run() throws Exception {
        // Create and populate a random cut forest

        int shingleSize = 4;
        int numberOfTrees = 50;
        int sampleSize = 256;
        Precision precision = Precision.FLOAT_32;
        int dataSize = 4 * sampleSize;

        // change this to try different number of attributes,
        // this parameter is not expected to be larger than 5 for this example
        int baseDimensions = 1;

        int dimensions = baseDimensions * shingleSize;
        TransformMethod transformMethod = TransformMethod.NORMALIZE;
        ThresholdedRandomCutForest forest = ThresholdedRandomCutForest.builder().compact(true).dimensions(dimensions)
                .randomSeed(0).numberOfTrees(numberOfTrees).shingleSize(shingleSize).sampleSize(sampleSize)
                .precision(precision).anomalyRate(0.01).forestMode(ForestMode.STANDARD).build();

        long seed = new Random().nextLong();
        System.out.println("seed = " + seed);
        Random rng = new Random(seed);
        for (int i = 0; i < dataSize; i++) {
            double[] point = new double[] { 0.6 + 0.2 * (2 * rng.nextDouble() - 1) };
            AnomalyDescriptor result = forest.process(point, 0L);
        }
        AnomalyDescriptor result = forest.process(new double[] { 11.2 }, 0L);
        System.out.println("At anomaly, grade: " + result.getAnomalyGrade() + ", score: " + result.getRCFScore());
        result = forest.process(new double[] { 0.2 }, 0L);
        System.out.println("Just after anomaly, grade: " + result.getAnomalyGrade() + ", score: " + result.getRCFScore());
        result = forest.process(new double[] { 0.6 }, 0L);
        System.out.println("Next after anomaly, grade: " + result.getAnomalyGrade() + ", score: " + result.getRCFScore());
        result = forest.process(new double[] { 10.0 }, 0L);
        System.out.println("Finally, grade: " + result.getAnomalyGrade() + ", score: " + result.getRCFScore());
    }
sudiptoguha commented 1 year ago

Indeed the PredictorCorrect was set as too aggressive -- changing

return thresholder.getAnomalyGrade(remainder dimensions / difference, previousIsPotentialAnomaly, triggerFactor) > 0; to return thresholder.getAnomalyGrade(remainder dimensions / difference, previousIsPotentialAnomaly) > 0

solves the issue (pushed to PR 354). Parametrized tests added to ThresholdedRandomCutForestTest.java

The expression (remainder * dimensions / difference) corresponds to an estimated contribution of the new input, including and after the gap. In the first expression, that contribution was evaluated against a more aggressive threshold determined by triggerFactor.

sudiptoguha commented 1 year ago

Resolved via PR 354.