How should we balance dict-based and binary representations of configured model step instances?
This topic was raised by @gitiauxx. Writeup based on discussion with @janowicz and @sablanchard.
Pros and cons
Currently, we have template objects save and recreate their configurations using dictionaries, which ModelManager stores on disk as yaml files.
Advantages of dict/yaml representations:
easily human readable
easy to track and diff using Git
easy to read in non-Python codebases
easy to determine the version of the file format, which makes exchanging files more reliable
But certain configured model steps are hard to represent as dictionaries. The milder version of this is something like the SmallMultinomialLogit class, where we use PyLogit for estimation. We're able to extract the fitted coefficients, which is enough to "run" the model step, but if a user wants to go back and do further inspection of the fitted PyLogit model object, the only way to recreate it is by saving it to a binary pickle file.
In a more challenging version of this, the RandomForestRegressionStep involves potentially thousands of sub-models, which would not be feasible to store as a dictionary even if we wanted to. So we'll need to save a binary representation of the fitted model even to "run" the model step.
Advantages of pickle representation:
easy storage of large objects, and objects from libraries we don't control
Python manages the file format (plus reading and writing), so there's less to implement and maintain on our end
Strategy
It seems like the best strategy for now is to follow a hybrid approach.
Templates can be required to produce a dictionary representation that includes, at a minimum, (a) all of the user-specified parameters and (b) some kind of description of the fitted model, like a summary table. Optionally, the dictionary can also include file names for binary payloads.
A future UI could directly read some basic information about the model step, and would go through Python backend services for building or running models, or getting indicators.
Eddie reports that the OPUS version of UrbanSim did something kind of similar to this, storing high-level settings in XML files and data arrays in NumPy binaries.
Tasks
To what extent do pickle files depend on the Python version or the version of the library defining the pickled object? Figure out how to deal with this reliably in the templates.
Develop some standards about how to specify and store binary payloads, probably including an indication of whether the payload is required or optional. Update ModelManager and the existing templates accordingly. Update the design patterns in the main README.
In the templates, we should switch to using the Python standard syntax for expressing dict representations of objects, __dict__ (which I hadn't known about). This will have the side benefit of making the objects automatically pickleable. Should we also be using __getstate__() and __setstate__()?
How should we balance dict-based and binary representations of configured model step instances?
This topic was raised by @gitiauxx. Writeup based on discussion with @janowicz and @sablanchard.
Pros and cons
Currently, we have template objects save and recreate their configurations using dictionaries, which ModelManager stores on disk as yaml files.
Advantages of dict/yaml representations:
But certain configured model steps are hard to represent as dictionaries. The milder version of this is something like the SmallMultinomialLogit class, where we use PyLogit for estimation. We're able to extract the fitted coefficients, which is enough to "run" the model step, but if a user wants to go back and do further inspection of the fitted PyLogit model object, the only way to recreate it is by saving it to a binary pickle file.
In a more challenging version of this, the RandomForestRegressionStep involves potentially thousands of sub-models, which would not be feasible to store as a dictionary even if we wanted to. So we'll need to save a binary representation of the fitted model even to "run" the model step.
Advantages of pickle representation:
Strategy
It seems like the best strategy for now is to follow a hybrid approach.
Templates can be required to produce a dictionary representation that includes, at a minimum, (a) all of the user-specified parameters and (b) some kind of description of the fitted model, like a summary table. Optionally, the dictionary can also include file names for binary payloads.
A future UI could directly read some basic information about the model step, and would go through Python backend services for building or running models, or getting indicators.
Eddie reports that the OPUS version of UrbanSim did something kind of similar to this, storing high-level settings in XML files and data arrays in NumPy binaries.
Tasks
To what extent do pickle files depend on the Python version or the version of the library defining the pickled object? Figure out how to deal with this reliably in the templates.
Develop some standards about how to specify and store binary payloads, probably including an indication of whether the payload is required or optional. Update ModelManager and the existing templates accordingly. Update the design patterns in the main README.
In the templates, we should switch to using the Python standard syntax for expressing dict representations of objects,
__dict__
(which I hadn't known about). This will have the side benefit of making the objects automatically pickleable. Should we also be using__getstate__()
and__setstate__()
?