aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
213 stars 34 forks source link

Rust serialization #380

Open acpeakhour opened 1 year ago

acpeakhour commented 1 year ago

Hi,

In a local branch I've added serialization using serde. One issue I came across is in nodestore.rs.

In the VectorNodeStore struct there is:

project_to_tree: fn(Vec<f32>) -> Vec<f32>,

which can't be serialized directly and I had to resort to skipping that field and using a default initialiser.

It would be great to have the serde annotations directly in the lib, and I am wondering as to your thoughts on how to address this issue.

sudiptoguha commented 1 year ago

Re "It would be great to have the serde annotations directly in the lib, and I am wondering as to your thoughts on how to address this issue." -- it's easy :) This is an Apache 2.0 project and contributions are welcome. But if for some reason that is not desirable, then that is ok too - we do plan to get to it in some time.

One of the lessons of RCF journey was the notion of serialization (or how a model is consumed) impacts all notions of algorithmic complexity. If the model is deserialized & serialized on every input, then that defines a workload different from sporadic ser-de. Ser-de is necessary for consumption. Now there can be two aspects (i) performance and (ii) interoperability. Interoperability can be language aware or language agnostic. I think protobuf is an example of the first (and my knowledge in this regard is limited) and text/JSON is language agnostic. Having a few representative serializations is sufficient; in the Java version we ended up just trying ProtoStuff and Json/Jackson. The remainder of the effort can go to enabling features like project_to_tree (which is idempotent/trivial at the moment) :) One nice potential thing about protobuf is that we could have the same models being passed around between a Java and a Rust environment. I have myself used both of those environments simultaneously to debug.

Caveats: As newer usages happen - it is possible that the basic RCF needs upgrade/re-orientation (for example, as soon. as project to tree becomes a nontrivial projection). But if the ser-de object has a version string then all of these are solvable. Testing serialization has also been an unclear area.

acpeakhour commented 1 year ago

Our use case is to persist the model between restarts of the application, serde did work for this case - however as you noted it isn't exactly fast.