aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
211 stars 34 forks source link

RCFCast #354

Closed sudiptoguha closed 1 year ago

sudiptoguha commented 1 year ago

Issue #, if available: 352

Description of changes: This PR introduces a calibrated TRCF which can be used in forecasting (and self calibrating). The basic operation is a TRCF but the sequential analysis is necessary for calibration. To this end the PR also provides SequentialAnalysis which takes the guesswork and inadvertent interpretations out of sequential analysis.

We note that calibration is a vast field; and the strong emphasis on sequential analysis can be seen in A.P. Dawid's opinions of "Prequential Analysis", in Present Position and Potential Developments: Some Personal Views: Statistical Theory: The Prequential Approach, Journal of the Royal Statistical Society. Series A (General) Vol. 147, No. 2, The 150th Anniversary of the Royal Statistical Society (1984), pp. 278-292 , https://www.jstor.org/stable/2981683

We recommend the article and note that in conclusion David emphasizes Markov models, cluster analysis and kernel width, Kalman-Bucy filters and many other topics of general Stochastic Processes. Serendipitously, RCF has many of these capabilities, and in fact as "reverse the usual direction of historical view" (a reference to the same article) is a natural foil of calibration.

At the same time, we differ in two positions. Both of these arise from assertions using Jeffrey's law that for Prequential Analysis, "in just those cases where we cannot choose empirically between several forecasting systems, it turns out that we have no need to do so". From an algorithmic standpoint it is obvious that (i) different algorithms would have different resource consumption, state requirements and (ii) we may simply not have enough data for a full blown probabilistic interpretation.Also (almost) the entire development of streaming algorithms is post 1984, and many of the resource consumption issues are intimately tied to formal notion of streaming algorithms.

And finally, if forecasts are to be consumed by alerts, it is likely that the alert is most indicative of a breakdown of stationarity and calibration is less useful at precisely those moments of alerts. Thus, even if we can calibrate, we recommend that user of calibration actively chooses to do so and decide if it is useful.

To try it out check out examples/src/main/java/com/amazon/randomcutforest/examples/parkservices/RCFCasterExample.java

kaituo commented 1 year ago

Comment on ConditionalSampleSummarizer.summarize:

Why do we add the largest distance * (1 - centrality) to threshold ?

        // note that the threshold is currently centrality * (some distance in the list)
        // thus the sequel uses a convex combination; and setting centrality = 0 removes
        // the entire filtering based on distances
        threshold += (1 - centrality) * newList.get(newList.size() - 1).distance;
        int num = 0;
        while (num < newList.size() && newList.get(num).distance <= threshold) {
            ++num;
        }

        ArrayList<Weighted<float[]>> typicalPoints = new ArrayList<>();

Is the computation of values necessary when j >= num? If no, can we change the loop condition from j < newList.size() to j < num?

        for (int j = 0; j < newList.size(); j++) {
            ConditionalTreeSample e = newList.get(j);

            float[] values;
            if (project) {
                values = new float[missingDimensions.length];
                for (int i = 0; i < missingDimensions.length; i++) {
                    values[i] = e.leafPoint[missingDimensions[i]];
                }
            } else {
                values = Arrays.copyOf(e.leafPoint, dimensions);
            }
            if (j < num) { // weight is changed for clustering,
                // based on the distance of the sample from the query point
                double weight = (e.distance <= threshold) ? e.weight : e.weight * threshold / e.distance;
                typicalPoints.add(new Weighted<>(values, (float) weight));
            }
        }
kaituo commented 1 year ago

Comment on NormalizedTransformer.invertForecastRange:

Why does NormalizedTransformer.invertForecastRange need to add i * deviations[j + inputLength].getMean()? (X+i) corresponds to the discounted average of the single step differences of the same variable i. I don't see you minus the same amount during normalization.

@Override
    public void invertForecastRange(RangeVector ranges, int baseDimension, double[] previousInput) {
        int inputLength = weights.length;
        int horizon = ranges.values.length / baseDimension;
        for (int i = 0; i < horizon; i++) {
            for (int j = 0; j < inputLength; j++) {
                double factor = 2 * (deviations[j].getDeviation() + DEFAULT_NORMALIZATION_PRECISION);
                ranges.scale(i * baseDimension + j, (float) factor);
                double shift = deviations[j].getMean() + i * deviations[j + inputLength].getMean();
                ranges.shift(i * baseDimension + j, (float) shift);
                float weight = (weights[j] == 0) ? 0 : 1.0f / (float) weights[j];
                ranges.scale(i * baseDimension + j, weight);
            }
        }
    }
kaituo commented 1 year ago

haven't finished a full pass of the PR. Will continue.

sudiptoguha commented 1 year ago

Re comment on Comment on NormalizedTransformer.invertForecastRange,

take a look at the superclass WeightedTransformer and updateDeviations; public void updateDeviation(double[] inputPoint, double[] previousInput) { checkArgument(inputPoint.length * 2 == deviations.length, "incorrect lengths"); for (int i = 0; i < inputPoint.length; i++) { deviations[i].update(inputPoint[i]); if (deviations[i + inputPoint.length].getCount() == 0) { deviations[i + inputPoint.length].update(0); } else { deviations[i + inputPoint.length].update(inputPoint[i] - previousInput[i]); } } }

sudiptoguha commented 1 year ago

Re Comment on ConditionalSampleSummarizer.summarize:

we add the ( 1- centrality) * largest distance because the threshold (after the addition) becomes

centrality distance[j] + (1-centrality)largest distance

So setting centrality = 1 would pick a lower threshold and perform filtering. Setting centrality = 0 would basically choose every point returned by all the trees. The difference between the num and newList.size() is for the clustering method (used for the most likely value) -- for clustering, the filtering is a useful denoting step. But otherwise we provide the p50 value of every value returned from the trees in the forecast (this is the j>num part).

kaituo commented 1 year ago

Finished one pass. Will discuss the followup questions offline.

sudiptoguha commented 1 year ago

Re "Is the computation of values necessary when j >= num? If no, can we change the loop condition from j < newList.size() to j < num?" --> no, the code can be simplified as suggested. Changing.