aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
206 stars 33 forks source link

revert: implement Serializable for ThresholdedRandomCutForestState #306

Closed ylwu-amzn closed 2 years ago

ylwu-amzn commented 2 years ago

Signed-off-by: Yaliang Wu ylwu@amazon.com

AD is using protostuff to serialize/deserialize RCF model. With the change "implement Serializable for ThresholdedRandomCutForestState", AD can't deserialize old models. So we have to revert this change. At the same time, we will use the same way of AD to serialize/deserialize RCF in ml-commons.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

ylwu-amzn commented 2 years ago

Tested every commit after PR #282 (AD is using this commit in 1.3.0), after PR https://github.com/aws/random-cut-forest-by-aws/pull/284 , RCF code can't deserialize 1.3.0 AD RCF model any more. This is different with yesterday's test result that RCF commit https://github.com/aws/random-cut-forest-by-aws/commit/2fedfeac8c325b7490594462aa6979659817f9f4 can deserialize AD 1.3.0 model. Will confirm with @amitgalitz to learn why the test results are different. But we still prefer to revert the PR #282 to not add any overhead about AD BWC. And we plan to use the same way of AD to serialize/deserialize RCF model. That will make our future BWC support easier. But cons is we have 2 serialization ways in ml-commons. Check PR https://github.com/opensearch-project/ml-commons/pull/251

Check more details on this issue https://github.com/aws/random-cut-forest-by-aws/issues/305#issuecomment-1079360251

To test the impact of my PR https://github.com/aws/random-cut-forest-by-aws/pull/298 , I applied it to commit https://github.com/aws/random-cut-forest-by-aws/commit/3bb8fcee46507ba83e6ce10432df45254654c363 (PR https://github.com/aws/random-cut-forest-by-aws/pull/283) which used by AD 1.3.0, check my branch https://github.com/ylwu-amzn/random-cut-forest-by-aws/tree/test_ad_protostuff, commit https://github.com/ylwu-amzn/random-cut-forest-by-aws/commit/bc7944eca8d24f008352c57dbe199e47e52958cb. Test locally, it can parse AD 1.3.0 model successfully.

ylwu-amzn commented 2 years ago

As explained above, PR #298 doesn't impact the deserialization of old AD model. And Sudipto fixed the backward compatibility issue in PR #309. So seems no need to revert the PR #298. Will close this PR for now. If we find PR #298 impacts any part, will revert then.