aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
206 stars 33 forks source link

handling models created via rc1 in rc2 #322

Closed sudiptoguha closed 2 years ago

sudiptoguha commented 2 years ago

Issue #, if available: 321

Description of changes: If a model is created using 3.0- rc-1 then the initialization of unused part of the pointstore array is 0. Whereas in the newer encoding the unused part if INFEASIBLE_LOCN (-1). This causes models created using 3.0-rc1 to generate errors when used with 3.0-rc2. We recommend using the most recent models for efficiency of size and compute.

ylwu-amzn commented 2 years ago

Thanks for fixing this quickly. For ml-commons, 1.3.1 is using PR #309 to build local jar. But RCF RC2 is built with PR #316. Not sure if this bug/breaking change introduced by which PR. Will it impact the model in ml-commons 1.3.1? Should we publish a new version like rc3?

sudiptoguha commented 2 years ago

Well, a numbering like rc2.1 is probably more appropriate. This change should not affect 1.3.1 -- because AFAIK the models are not being persisted (is that correct?). This issue arises because the two versions of the model use different representations for the unused part of the array, the two versions should be internally consistent. But it would be good to switch to most recent models as soon as possible, since it would get harder to fix bugs (should they appear and there likely is one somewhere)

ylwu-amzn commented 2 years ago

ml-commons also stores model, using the same way of AD. But ml-commons 1.3.1 is using newer version of RCF than AD 1.3.1. Good to know "two versions should be internally consistent". But will follow your suggestion to move to latest version once it's ready on Maven.