UDST / bayarea_urbansim

UrbanSim implementation for the San Francisco Bay Area
14 stars 26 forks source link

version control for outputs (ipython notebook versions?) #35

Closed tbuckl closed 8 years ago

tbuckl commented 9 years ago

There is an open-ended question about how to version outputs. Currently, we version outputs by default using the ipython notebook output cells. We assume that whatever notebook is more recent is the primary one and we delete the previous one.

However, comparing BLOBs of outputs in JSON is difficult, so it may not be possible to meaningfully compare outputs across notebooks. If we can't meaningfully compare outputs across notebooks, then why would we version the notebook with the outputs in it?

Is the goal to version the output cells, or the input cells? Or both?

Here are a few links to discussion on how other people have thought about versioning for ipython notebooks.

https://ipython.org/ipython-doc/stable/interactive/tips.html#lightweight-version-control

https://github.com/ipython/ipython/issues/8009

tbuckl commented 9 years ago

The importance of versioning notebooks/outputs is not limited to the "working" outputs from them. In fact, it may be more pressing for debugging. Whats the best way to share errors in notebooks that need to be debugged? Recently, we committed to master a copy-and-paste of the errors from the Simulation notebook as a text file. https://github.com/MetropolitanTransportationCommission/bayarea_urbansim/commit/4f462d24a1ea56d1e3c17052ce7f93c8bc1e2a8e

This is clearly not a good way to proceed. So what else might we do? Committing those changes on a temporary branch would probably be wiser, but is that the best way to work?

One related issue is that before the Simulation notebook is run, the Estimation notebook might(should?) be run, and this will produce outputs that change yaml files in "configs." This is especially confusing for a new user (especially one thats working with git). It seems that all of the changed YAML files in /configs/ should also be checked in so that Simulation can be debugged with those. However, the user might not know whether or not those configs were relevant to the Simulation bug.

fscottfoti commented 9 years ago

Generally speaking I think Notebooks are terrible for version control for all the obvious reasons. I've begin to just use them for development and polished outputs come from a straight Python script - .e.g. -

https://github.com/synthicity/bayarea_urbansim/blob/master/Simulation.py

As for estimation, I do not think we should be running estimation before simulation every time. Estimation rarely needs to change once we get coefficients we believe in and we just modify simulation inputs and rerun. It does make sense to test estimation to make sure it works on a regular basis, but I would then discard the results. That said, I have often wanted the feature that would check for 1 or 2 decimal place closeness of coefficients and not update the YAML files if it's the same at that degree of precision.

tbuckl commented 9 years ago

Thanks @fscottfoti. So it sounds like the best thing would be for us to share Python scripts when it comes to the input. What do you make of sharing the outputs? Should we just write the standard error and standard output as Simulation.stderr and Simulation.stdout and commit those? Also, should we commit these on a new branch each time? It seems that we don't need to ever merge back in outputs to master, but we would like to keep track of them and share them.

fscottfoti commented 9 years ago

Makes sense. Honestly just making a gist seems like a good idea for some of these things. Saving and sharing the stdout on the outputs makes a lot of sense. I don't think that these are really version controlled though - I mean there's random noise every time you run so you can't really compare them. I mean you just run them and tag them with a date and some git hashes and just save them. I wonder if you could just make the output directory sync with Box and do it that way?

tbuckl commented 9 years ago

i can't speak to the random noise. @mkreilly any thoughts on that?

tbuckl commented 9 years ago

that said, i will remove MetropolitanTransportationCommission@4f462d2 and put it on a branch with the configs

tbuckl commented 9 years ago

this is how grumpy cat feels about random noise: how grumpy cat feels about noise