Standard serialization/deserialization API across all data-structures

timoxley commented 7 years ago

Currently one can serialize mnemonist data-structures with .toJSON but there does not appear to be a standard way to deserialize. I'd like to be able to cache mnemonist structures in the browser or to send pre-processed structures to the client over a network.

To achieve serialization/deserialization at the moment, one has to write custom functions which often need to re-iterate over the entire data set. Re-iterating may be prohibitively expensive for large structures i.e. this sucks most for the exact use-cases where mnemonist would be most useful.

e.g. there should be a way to do something like:

dest.fromJSON(src.toJSON())

and ideally the deserialization process would need to do minimal reprocessing, it would basically just dump the data into place, something like:

dest.root = src.toJSON()

For example, I'd hoped this exact thing would work for Trie, except that toJSON loses the size information, and if you added .size to the structure produced by toJSON, you'd potentially break any 3rd party code consuming the current toJSON format.

Therefore, you should probably should use something other than toJSON, instead create a new API pair e.g. serialize/deserialize which produces/consumes a representation whose structure users would consider opaque because:

the best encoding for fast serialization/deserialization may not be JSON (e.g. bloom filters) and
you want to be able to have the freedom to change the serialization format so you're not stuck working with some inefficient representation simply because you want to avoid breaking the public toJSON API.

Perhaps serialize would just generate JSON or a JS object for now, but you don't want to be locked into that, nor into the structure it produces.

Related to #28

timoxley commented 7 years ago

in #28 @Yomguithereal asks:

The question I am pondering before implementing this is whether this should be an instance or a static method & if this is an instance method, what should it do if the structure has already been fed some data? We just add serialized data? We clear then add serialized data?

My suggestion is make the thing static and only for creating new instances, like .from. Figuring out how to diff/union with existing data would be fantastically useful, but is perhaps a separate issue. The problem at hand is that there's currently a high cost + custom code required to utilize mnemonist structures outside of the current process e.g. restoring from disk/db or sending over network.

Yomguithereal commented 7 years ago

In a first time I think I will go will go with a symmetric static .fromJSON method that should address most of the cases (Bloom filters may very well be serialized to JSON as an array, even if this is more costly than its Byte array representation counterpart). I will let the serialize etc. open because, as you said, it leaves the possibility for more complex and efficient serialization strategies.

What we can do, as starter, here, is to make a list of the different structures and see 1) can a .fromJSON work & 2) can we imagine better serialization schemes.

Yomguithereal commented 7 years ago

BKTree & VPTree: the distance cannot be serialized, so it leaves to the user to use the exact same distance or fail. Else we just need to serialize the root. Serializing the tree as a flat instance is possible.
Trie: we need to track size + there are many efficient ways to serialize tries (dag etc.). But I would probably need to code the RadixTree before this.
BloomFilter: we need to keep the number of hash functions to be used.
Heaps & LinkedList: I am not sure to see the point of serializing heaps. Representing them as arrays might be easier.

Yomguithereal commented 7 years ago

@timoxley @GeoffreyPlitt what's your opinion on this?

timoxley commented 7 years ago

Sounds like a plan 🍾

Yomguithereal commented 7 years ago

I will therefore modify some of the existing toJSON methods to take this into account.

Concerning the BKTree etc. I guess the signature will be the following:

BKTree.fromJSON(distance, json);

Yomguithereal / mnemonist

Standard serialization/deserialization API across all data-structures #37