Design decisions around low-level vs high-level objects, yaml vs. csv serialization, etc...

fscottfoti commented 9 years ago

The code in https://github.com/synthicity/activitysim/pull/4 already begs some interesting questions. The biggest of which is how much to use the dcm code that Matt is currently working on. As Matt and I discussed, dcm is primarily useful for serializing to YAML, for wrapping of low-level models inside larger segmented models and a few other small things.

If we're not necessarily wed to YAML and if we're primarily interesting in simulation as opposed to estimation (early feedback from Dave Ory says this might be the case), and if we want to stick with the CSV/XLSX format for storing coefficients (which is nice when doing alternative specific coefficients because that's naturally 2D), then we're free to address this problem a little more directly using the other underlying utilities we have built.

https://github.com/synthicity/activitysim/blob/adding-defaults/example/models.py does exactly this. The directness and conciseness of this approach is compelling. We could potentially build up a new set of utilities with a slightly different set of design decisions if we want to. At this time, the dependencies used from UrbanSim are essentially 1) the simulation framework for variables, tables, etc 2) low-level choice utilities from the urbanchoice directory and 3) utils.py for really nothing important. We've discussed putting these in a different repo before so it's worth mentioning again in the context of this larger discussion.

DavidOry commented 9 years ago

Would it be possible to spec out the auto ownership model in YAML format to allow us to compare the YAML versus CSV approaches? I like the idea of seeing the utility equation(s), which the CSV approach makes difficult. But it may be that the YAML format gets really messy?

Also, are there other examples from the scientific computing realm that may help inform the decision? Any standards and/or common approaches??

jiffyclub commented 9 years ago

It would look similar to this example of household location choice from our UrbanSim demo: https://github.com/synthicity/sanfran_urbansim/blob/master/configs/hlcm.yaml

Nothing we've done so far for UrbanSim has required alternative-specific coefficients and the 2D array of coefficients is something a little more suited to CSV than YAML. There are certainly ways to accommodate a table, though. For example, here's the output of converting a Pandas DataFrame into YAML.

Python:

In [14]: df
Out[14]:
   a  b  c
x  0  3  6
y  1  4  7
z  2  5  8

YAML:

a:
    x: 0
    y: 1
    z: 2
b:
    x: 3
    y: 4
    z: 5
c:
    x: 6
    y: 7
    z: 8

fscottfoti commented 9 years ago

The other limitation is that right now we don't do complicated expressions in our YAML files. We compute variables in Python and then you can do very simple transformations as supported by the Patsy syntax. Unfortunately patsy is pretty limited. In the current csv files there's a lot of expressions that are in between what is supported by patsy and something that requires writing Python - e.g. things like

(distance-1).clip(0,1).

We could support that in YAML of course, but we would typically do that in Python right now. Remember the almost magical advantage that Python has over Java is that it's interpreted, which means you can write small Python snippets (not an expression language) as text in your CSV and then call eval and get everything that's in the scope of the code you're running at that time you execute it, so I went ahead and did that in the first example just to show how it might work. Just make sure not to put "rm -rf *" in your csv file.

By far the biggest issue is the 5 alternatives x 15 variables table in some of these CSV files, which would be 75 lines in a YAML file. Also the lines in YAML would be something like (distance-1).clip(0,1) * 0_cars - notice the alternative is explicit here where it's implicit in a table form. Of course YAML also doesn't support multiple worksheets and color coding that is nice in XLS spreadsheets, but all that comes at the expense of being able to version control the text and see it on github like this

The main drawback of CSV vs YAML is not having a place to put filters or choosers or segmentation and higher level concepts like that (model configuration as opposed to specification). Right now that's directly expressed in the model.

jiffyclub commented 9 years ago

As to your question about other examples of this, I don't know of anything comparable. In the Python circles I'm familiar with people mostly keep their configuration and specifications as code in Python files and IPython Notebooks.

fscottfoti commented 9 years ago

We should make a call on this pretty soon as to whether to allow specification of models in csv or stick with yaml. I thought it would be helpful to be concrete - here is one of the model specifications converted from csv to yaml (you can also see the few lines it takes to convert csv to yaml if you're interested). This is the kind of thing you can expect from a YAML representation: 1) row-by-row specification 2) repetition of column headers 3) the need to know yaml syntax and yes you can have syntax errors 4) no gui to change the formatting of columns 5) you will generally edit these things in your favorite text editor rather than excel. If you want me to send a screenshot of the table in Excel I can do that, but I figured we can imagine what it looks like in Excel (but let me know if otherwise).

http://nbviewer.ipython.org/github/synthicity/activitysim/blob/cdap/notebooks/config_csv_to_yaml.ipynb

fscottfoti commented 9 years ago

I just realized the csv version of the file is available on github here

https://github.com/synthicity/activitysim/blob/cdap/example/configs/cdap_2_person.csv

danielsclint commented 9 years ago

I think based on our call before the break, we were in general agreement about moving forward with CSV. I think the general direction was to move forward with CSV until it clearly doesn't fit the mold anymore. Once we run into that problem, if at all, we could come back and re-evaluate the whether we convert to YAML or move forward with some type of hybrid approach.

It looks like switching between the two formats is pretty easy on the input file side. How much impact inside the code is there to read the data format is there if we started with CSV and for some reason had to switch to YAML for everything?

fscottfoti commented 9 years ago

Sounds like a plan. It will not be hard to switch from CSV to YAML for the specifications. It will be slightly harder to go from Python to YAML if we prefer configuration files for the higher level model configuration, but only if we actually end up doing something non-standard in Python, which is rare.

ActivitySim / activitysim

Design decisions around low-level vs high-level objects, yaml vs. csv serialization, etc... #5