aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
210 stars 33 forks source link

implement Serializable for ThresholdedRandomCutForestState #298

Closed ylwu-amzn closed 2 years ago

ylwu-amzn commented 2 years ago

Signed-off-by: Yaliang Wu ylwu@amazon.com

https://github.com/aws/random-cut-forest-by-aws/issues/297

Description of changes: implement Serializable for ThresholdedRandomCutForestState

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

jotok commented 2 years ago

Per the recommendation in the Javadocs for Serializable, we should explicitly define the serialVersionUID constant to the classes that implement Serializable. https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/Serializable.html

jotok commented 2 years ago

Can we create an example based on the existing serialization examples that serializes and deserializes a forest and validates that the deserialized forest produces outputs that are consistent with the original forest? https://github.com/aws/random-cut-forest-by-aws/blob/main/Java/examples/src/main/java/com/amazon/randomcutforest/examples/serialization/ProtostuffExample.java

ylwu-amzn commented 2 years ago

Per the recommendation in the Javadocs for Serializable, we should explicitly define the serialVersionUID constant to the classes that implement Serializable. https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/Serializable.html

Thanks, will add. BTW, will evaluate other serialization lib later and may change the serialization way in MLCommons. Currently use ObjectOutputStream is mainly to keep consistent with current code and save effort to unblock some work.

ylwu-amzn commented 2 years ago

existing serialization examples that serializes and deserializes a forest and validates that the deserialized forest produces outputs that are consistent wi

Sure, will add

ylwu-amzn commented 2 years ago

Can anyone help take a look and approve if no more comments? @jotok @sudiptoguha

ylwu-amzn commented 2 years ago

Ran ObjectStreamExample on Mac successfully.

dimensions = 10, numberOfTrees = 50, sampleSize = 256, precision = FLOAT_32
Object output stream size = 326809 bytes
Looks good!