aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
213 stars 34 forks source link

Fix calculation of gap thresholds #408

Closed kaituo closed 1 month ago

kaituo commented 1 month ago

Issue #, if available:

Description of changes: In the calculation of gapLow[y] and gapHigh[y], the expressions for the ratio-based thresholds were incorrectly using Math.abs(a) where a = scale[y] * point[startPosition + y]. Since point[startPosition + y] is the normalized value (x - mean) / std, multiplying by scale[y] (which is std) gives (x - mean).

However, to accurately compute the thresholds based on the actual value x, we need to add back the mean (shiftBase). Therefore, (a + shiftBase) equals (x - mean) + mean = x.

The corrected code now uses Math.abs(a + shiftBase). Read changes in PredictorCorrector for details.

Testing done:

  1. added an IT.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.