aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
213 stars 34 forks source link

introducing multi-mode(l) operation and initiating rcf 3.8 #389

Closed sudiptoguha closed 1 year ago

sudiptoguha commented 1 year ago

Issue #, if available: resolves #388 and #387

Description of changes: Ensembles of models have long been used to refine results. However a significant drawback is that the space required to store the numerous model increases rapidly -- and by definition only one/few models are eventually used. One of the driving forces behind the creation of the RandomCutForest repository was to expose the RCF data structure (see https://opensearch.org/blog/random-cut-forests/) and even though it was originally used for anomaly detection, it has been used in forecasting and density estimation applications. In RCF 3.7 the forecasting capabilities were used in the predictor-corrector (an auto reinforcement) setup to reduce false positives -- this had only minimal increase in model size. This PR takes a step further and introduces multi-mode(l) operations, specifically using DISTANCE computation in density estimation to augment (as well as provide options for) the EXPECTED_INVERSE_DEPTH scoring. The latter is useful since it can provably provide conformal forecasts (RCFCast); and yet the recursive distance estimation provides a somewhat orthogonal option. The MULTI_MODE and MULTI_MODE_RECALL do increase precision and recall respectively over the current defaults; again with minimal increase in model size (however computation does increase). We intend to add other scoring strategies and (auto-)optimize the defaults in subsequent PRs. Please add an issue if any specific scoring mode is desirable. We will be adding CO_DISPLACEMENT eventually.

The PR also resolve a few corner cases of parameter settings and sundry issues.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

kaituo commented 1 year ago

I have completed a partial code review and will return to review the remaining files (mapper/state and new commit) later.

kaituo commented 1 year ago

CI is failing:

Error: SummaryTest{BiFunction}[2] Time elapsed: 185.454 s <<< FAILURE! java.lang.AssertionError at com.amazon.randomcutforest.SampleSummaryTest.SummaryTest(SampleSummaryTest.java:73)

sudiptoguha commented 1 year ago

CI is failing:

Error: SummaryTest{BiFunction}[2] Time elapsed: 185.454 s <<< FAILURE! java.lang.AssertionError at com.amazon.randomcutforest.SampleSummaryTest.SummaryTest(SampleSummaryTest.java:73)

yes, it's a randomized test where the number of clusters is more than the tested probability of 1 in 50.