aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
213 stars 34 forks source link

Fix Confidence Adjustment for Larger Shingle Sizes #407

Closed kaituo closed 3 months ago

kaituo commented 3 months ago

Description of changes:

This PR addresses further adjustments to the confidence calculation issue discussed in PR 405. While PR 405 successfully resolved the issue for a shingle size of 4, it did not achieve the same results for larger shingle sizes like 8.

Key Changes

  1. Refinement of seenValues Calculation:
    • Previously, the formula increased confidence even as numImputed (number of imputations seen) increased because seenValues (all values seen) also increased.
    • This PR fixes the issue by counting only non-imputed values as seenValues.
  2. Upper Bound for numImputed:
    • The numImputed is now upper bounded to the shingle size.
    • The impute fraction calculation, which uses numberOfImputed * 1.0 / shingleSize, now ensures the fraction does not exceed 1.
  3. Decrementing numberOfImputed:
    • The numberOfImputed is decremented when there is no imputation.
    • Previously, numberOfImputed remained unchanged when there is an imputation as there was both an increment and a decrement, keeping the imputation fraction constant. This PR ensures the imputation fraction accurately reflects the current state. This adjustment ensures that the forest update decision, which relies on the imputation fraction, functions correctly. The forest is updated only when the imputation fraction is below the threshold of 0.5.

Testing

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

amitgalitz commented 3 months ago

The numImputed is now upper bounded to the shingle size.

Do we want to do this because cause we only care for the fraction in relation to the current shingle?

kaituo commented 3 months ago

The numImputed is now upper bounded to the shingle size.

Do we want to do this because cause we only care for the fraction in relation to the current shingle?

yes.