aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
206 stars 33 forks source link

Rust serialization #380

Open acpeakhour opened 1 year ago

acpeakhour commented 1 year ago

Hi,

In a local branch I've added serialization using serde. One issue I came across is in nodestore.rs.

In the VectorNodeStore struct there is:

project_to_tree: fn(Vec<f32>) -> Vec<f32>,

which can't be serialized directly and I had to resort to skipping that field and using a default initialiser.

It would be great to have the serde annotations directly in the lib, and I am wondering as to your thoughts on how to address this issue.

sudiptoguha commented 1 year ago

Re "It would be great to have the serde annotations directly in the lib, and I am wondering as to your thoughts on how to address this issue." -- it's easy :) This is an Apache 2.0 project and contributions are welcome. But if for some reason that is not desirable, then that is ok too - we do plan to get to it in some time.

One of the lessons of RCF journey was the notion of serialization (or how a model is consumed) impacts all notions of algorithmic complexity. If the model is deserialized & serialized on every input, then that defines a workload different from sporadic ser-de. Ser-de is necessary for consumption. Now there can be two aspects (i) performance and (ii) interoperability. Interoperability can be language aware or language agnostic. I think protobuf is an example of the first (and my knowledge in this regard is limited) and text/JSON is language agnostic. Having a few representative serializations is sufficient; in the Java version we ended up just trying ProtoStuff and Json/Jackson. The remainder of the effort can go to enabling features like project_to_tree (which is idempotent/trivial at the moment) :) One nice potential thing about protobuf is that we could have the same models being passed around between a Java and a Rust environment. I have myself used both of those environments simultaneously to debug.

Caveats: As newer usages happen - it is possible that the basic RCF needs upgrade/re-orientation (for example, as soon. as project to tree becomes a nontrivial projection). But if the ser-de object has a version string then all of these are solvable. Testing serialization has also been an unclear area.

acpeakhour commented 10 months ago

Our use case is to persist the model between restarts of the application, serde did work for this case - however as you noted it isn't exactly fast.