evolvable, portable, efficient simulation data files

As discussed in PR #282, raw data, sim data, and validation data files are currently Pickled internal data structures. This approach is easy but tends to create files that carry unneeded data and are brittle across code changes. Also the files can only be read by Python and there's a security risk (small in our uses) of reading a Pickle file that runs malicious shell commands. Making the files smaller could have a bigger performance impact when sending them to many cells in a distributed, multi-cell model.

We ought to:

Design what data elements to include, what to exclude, and how to express them for longevity and portability. Reduce assumptions in the file format. Until there are specific interchange needs, a first cut should suffice.
Implement that using a portable, compact, efficient encoding format such as MessagePack or CBOR. That is, pick a better encoding while implementing item (1). Or start with (1) by implementing the pickle protocol one class at a time. Pickle enables but doesn't require longevity and evolution.

What's the priority of this issue?

Encoding Format

Reading the specs for various encoding formats:

MessagePack is a good choice. It's simple, compact, and widely supported. The original spec didn't distinguish binary arrays from Unicode strings, and that's a problem for Python, but a revision fixed that, so set optional encoder/decoder arguments to adopt the revision.
CBOR is also a good choice. It's inspired by MessagePack, fixes the string problem, and has a better extension mechanism. The spec is much longer because it answers many questions like what MIME type to use, how to prefix a magic number, recommendations for converting to/from JSON, registering new extension ID codes with IANA, a diagnostic printout format, and so on. What might matter to us are the extension types for big integers and shared values. Shared values would avoid multiple copies of each parameter name string in a file.
JSON is very portable but slower and less compact since it's a text format. The JSON spec doesn't support Infinity and NaN values but the Python json package does unless asked to be standards-compliant. It doesn't distinguish floats that happen to have integer values from integers. Dictionary keys must be strings.
BSON and UBJSON are interesting binary formats but apparently they're not so compact, fast, or simple because they're designed for storage and in-place overwrites. MongoDB uses BSON so the pymongo pip already in our pyenv implements it.
We probably want to stick with a schema-less format so I didn't look into Avro (FYI, Kafka uses it), Protocol Buffers, Thrift, and others.

Either MessagePack or CBOR should be fine The quality of their encoder/decoder implementations may be the determining factor.

CBOR has 4 different pips to choose from, which is not a plus. The cbor2 pip says the cbor pip is faster due to its C extensions but it lacks documentation and a comprehensive test suite, does not support most standard extension tags, and will segfault if passed a cyclic structure. The cbor2 pip optionally allows cyclic and shared values (off by default) via an extension type.

Parameters, Units, and Dimensions

This is the start of the contents design discussion -- item (1) above.

PR #282 proposed dictionaries (in code, but easily put in a data file or a specification doc if needed) that map names to persistent, immutable singletons of these types:

Dimensions: has a name such as "length".
Units: has a name, a Dimensions, and a conversion coefficient. The conversion coefficient must be consistent with other Units for the same Dimensions and should be relative to something like standard SI units.
Parameter: has a name, a type, and a Dimensions. The type could identify a NumPy dtype (although that restricts the data to fixed size integers and floats) and any array dimensions.

(Call these types "Dimension" and "Unit" for singular/plural clarity? E.g. a Unit has a Dimension and its conversion coefficient must be consistent with other Units with the same Dimension.)

Each data file would reference this collection of dictionaries by name. Changing dictionary contents calls for changing its name (versioning). Adding dictionary entries wouldn't require changing its name if we don't need backward compatibility of newer data files. The dictionary entries could have docstrings which wouldn't have to be persistent, i.e. they could be improved over time.

Each Value in a data file has a quantity and names a Parameter and a Units. The quantity must match the Parameter's type and the Units must match the Parameter's Dimensions.

(A more compact representation would use a persistent index number for each Dimensions, Units, and Parameter.)

In working memory, the code can request a Value converted to a desired target Units, can label a Value's Units for human readers, and handles correctly mixing Units and rejecting incompatible Dimensions when performing math operations.

CovertLab / wcEcoli

evolvable, portable, efficient simulation data files #301