UDST / urbansim_templates

Building blocks for simulation models
https://udst.github.io/urbansim_templates
BSD 3-Clause "New" or "Revised" License
20 stars 13 forks source link

Formats for persistent storage of configured model step instances #37

Open smmaurer opened 6 years ago

smmaurer commented 6 years ago

How should we balance dict-based and binary representations of configured model step instances?

This topic was raised by @gitiauxx. Writeup based on discussion with @janowicz and @sablanchard.

Pros and cons

Currently, we have template objects save and recreate their configurations using dictionaries, which ModelManager stores on disk as yaml files.

Advantages of dict/yaml representations:

But certain configured model steps are hard to represent as dictionaries. The milder version of this is something like the SmallMultinomialLogit class, where we use PyLogit for estimation. We're able to extract the fitted coefficients, which is enough to "run" the model step, but if a user wants to go back and do further inspection of the fitted PyLogit model object, the only way to recreate it is by saving it to a binary pickle file.

In a more challenging version of this, the RandomForestRegressionStep involves potentially thousands of sub-models, which would not be feasible to store as a dictionary even if we wanted to. So we'll need to save a binary representation of the fitted model even to "run" the model step.

Advantages of pickle representation:

Strategy

It seems like the best strategy for now is to follow a hybrid approach.

Templates can be required to produce a dictionary representation that includes, at a minimum, (a) all of the user-specified parameters and (b) some kind of description of the fitted model, like a summary table. Optionally, the dictionary can also include file names for binary payloads.

A future UI could directly read some basic information about the model step, and would go through Python backend services for building or running models, or getting indicators.

Eddie reports that the OPUS version of UrbanSim did something kind of similar to this, storing high-level settings in XML files and data arrays in NumPy binaries.

Tasks

  1. To what extent do pickle files depend on the Python version or the version of the library defining the pickled object? Figure out how to deal with this reliably in the templates.

  2. Develop some standards about how to specify and store binary payloads, probably including an indication of whether the payload is required or optional. Update ModelManager and the existing templates accordingly. Update the design patterns in the main README.

  3. In the templates, we should switch to using the Python standard syntax for expressing dict representations of objects, __dict__ (which I hadn't known about). This will have the side benefit of making the objects automatically pickleable. Should we also be using __getstate__() and __setstate__()?

smmaurer commented 6 years ago

PR #41 implements task 2 from above. Still need to take a look at 1 and 3.