CovertLab / wcEcoli

Whole Cell Model of E. coli
Other
18 stars 4 forks source link

evolvable, portable, efficient simulation data files #301

Open 1fish2 opened 6 years ago

1fish2 commented 6 years ago

As discussed in PR #282, raw data, sim data, and validation data files are currently Pickled internal data structures. This approach is easy but tends to create files that carry unneeded data and are brittle across code changes. Also the files can only be read by Python and there's a security risk (small in our uses) of reading a Pickle file that runs malicious shell commands. Making the files smaller could have a bigger performance impact when sending them to many cells in a distributed, multi-cell model.

We ought to:

  1. Design what data elements to include, what to exclude, and how to express them for longevity and portability. Reduce assumptions in the file format. Until there are specific interchange needs, a first cut should suffice.
  2. Implement that using a portable, compact, efficient encoding format such as MessagePack or CBOR. That is, pick a better encoding while implementing item (1). Or start with (1) by implementing the pickle protocol one class at a time. Pickle enables but doesn't require longevity and evolution.

What's the priority of this issue?

Encoding Format

Reading the specs for various encoding formats:

Either MessagePack or CBOR should be fine The quality of their encoder/decoder implementations may be the determining factor.

CBOR has 4 different pips to choose from, which is not a plus. The cbor2 pip says the cbor pip is faster due to its C extensions but it lacks documentation and a comprehensive test suite, does not support most standard extension tags, and will segfault if passed a cyclic structure. The cbor2 pip optionally allows cyclic and shared values (off by default) via an extension type.

Parameters, Units, and Dimensions

This is the start of the contents design discussion -- item (1) above.

PR #282 proposed dictionaries (in code, but easily put in a data file or a specification doc if needed) that map names to persistent, immutable singletons of these types:

(Call these types "Dimension" and "Unit" for singular/plural clarity? E.g. a Unit has a Dimension and its conversion coefficient must be consistent with other Units with the same Dimension.)

Each data file would reference this collection of dictionaries by name. Changing dictionary contents calls for changing its name (versioning). Adding dictionary entries wouldn't require changing its name if we don't need backward compatibility of newer data files. The dictionary entries could have docstrings which wouldn't have to be persistent, i.e. they could be improved over time.

Each Value in a data file has a quantity and names a Parameter and a Units. The quantity must match the Parameter's type and the Units must match the Parameter's Dimensions.

(A more compact representation would use a persistent index number for each Dimensions, Units, and Parameter.)

In working memory, the code can request a Value converted to a desired target Units, can label a Value's Units for human readers, and handles correctly mixing Units and rejecting incompatible Dimensions when performing math operations.

jmason42 commented 6 years ago

:+1: Thanks for compiling and expanding all this discussion, I'll probably read through this a few times.

(Call these types "Dimension" and "Unit" for singular/plural clarity? E.g. a Unit has a Dimension and its conversion coefficient must be consistent with other Units with the same Dimension.)

I was thinking about this too, and I'm inclined to agree. We say "units" because "meters per second" isn't just one unit, it's two, and the pluralization is more general. There is a broader conversation here to have about what we want to do with the support for unit'd quantities in the model and associated code.

By the way, scipy.constants does a lot of the work for us.