Description

kiara needs to be able to serialize every data set (independent of how complex it is: a single int, an Arrow table, a PDF file, ...) into a (preferrable) binary array that can be saved (and subsequently read) from disk, as well as transferred/streamed to a frontend or compute node.

In the first iteration of Lumy JSON was used for this purpose, which in my view is sub-optimal. For one, it's a waste of memory, bandwidth and cpu resources, but more importantly this makes it very hard to support more dataset types in the future (most of which will be binary in some way. This would leave baseXX-encoding those datasets as our only option, which is a bad idea for several reasons).

Once we settle on a serialization format, we need to implement one kiara module per supported data type which has an item of that type as input, and a binary array as output (maybe also schema information about the serialized item). Also, we need one module per data type that takes the serialized data as input, and re-assembles it into a Python object.

In addition, we need to create a 'parent-serialization' kiara module that can save any of those resulting binary arrays to disk (which should be trivial). This will be used internally by the kiara datastore as well as the module result cache to persist any of the datasets involved.

Task list:

[x] investigate serialization formats and check fitness for our purpose (avro, protobuf, capn-proto, flatbuffers, ...)
[x] implement generic serialization modules for each of the core data types
[x] create a module to write serialized datasets to disk
[x] create a module to read serialized datasets from disk

DHARPA-Project / kiara

Data serialization #16

Description

Task list: