UDST / urbansim_templates

Building blocks for simulation models
https://udst.github.io/urbansim_templates
BSD 3-Clause "New" or "Revised" License
20 stars 13 forks source link

Store additional model fit statistics #114

Open smmaurer opened 5 years ago

smmaurer commented 5 years ago

This is a proposal to update the statistical templates to store additional model fit statistics as individually accessible values, rather than just embedded in the summary table. This would make downstream analysis easier.

How it works now

Using Large MNL as an example, these are the current class attributes related to model fit:

Proposed changes

I'm thinking we should add a class attribute called fit_statistics and use it to store a flexible dictionary of values -- standard errors, t-stats, p-values, and model-level statistics like r^2 or log-likelihood.

For large MNL, we can include everything we have access to from ChoiceModels: mnl.py#L344-L360. We'd build the dict at the end of the fit() method (large_multinomial_logit.py#L490), and include it in the constructor and dict i/o methods alongside summary_table and fitted_parameters.

It looks like we can do this without any changes to choicemodels, because these statistics are part of the results object that's available immediately after fitting the model.

Large MNL is the first priority here -- the other templates could probably wait until we refactor them for issue #54.

Potential questions

  1. Dict vs. individual attributes

Should we store these as top-level attributes of the template class, rather than inside a dictionary? That would be more Pythonic, and would make it more obvious if info was missing or formats changed. But it would make the class definitions a lot longer, and it would vary between model types -- especially once we expand into multi-stage models and machine learning approaches. So i think better to just have a dictionary whose name is the same across models but whose content varies.

  1. Keep fitted parameters separate?

Should we keep fitted_parameters as its own attribute, or include it in the dict? I'd lean toward keeping it separate. It's the most important piece of info because it feeds directly into the simulations. And leaving it where it is means less downstream code to change.

  1. Stop saving summary tables?

Should we stop saving summary tables and generate them on the fly instead? This would make the yaml files tidier. But i'm thinking we should keep them for now -- they come from multiple sources (ChoiceModels, Statsmodels, PyLogit) and we don't have a standard procedure for generating them yet. It would be nice to store them in a tidier way, though, per issue #48.

Tagging @alephcero @cvanoli for feedback

smmaurer commented 5 years ago

A variant of this that's probably nicer is to generate the dictionary inside choicemodels. We'd replace the so-called "raw results" of fitting the model with a tidy dictionary containing all of the estimated parameters, test statistics, and other metadata.

On the UrbanSim side, the template would save the entire dict into its yaml. The advantage here is that a user could then recreate the choicemodels object at a later time in order to do additional analysis. (For example, generate optional output like relative risk ratios, or reprint the summary table without it being saved in the yaml file.)

The interface would look something like this:

m  # choicemodels.MultinomialLogit
results = m.fit()  # choicemodels.MultinomialLogitResults

print(results)  # summary table
results.raw_results  # NEW, this would provide a tidy dict

choicemodels.MultinomialLogitResults(model_expression, results)  # rebuild from dict

(Historical note: The current "raw results" are already a dictionary, but with a confusing key structure that we don't want to retain. It's generated by the core estimation code, and we can't change it directly because other code relies on it. So i think we'd implement the new dict as a translation layer inside choicemodels.MultinomialLogitResults: mnl.py#L228.)

cvanoli commented 5 years ago

Per potential questions:

  1. I agree with you that it is better to store the results as a dictionary. In that way, it will be flexible among different models and the user can look at the dictionary keys to find the estimation information needed.
  2. Also agree with keeping fitted_parameters separate although they are also included in the fit_statistics object we will create.
  3. I would do both. Save the summary table in the yaml file (probably look for a better-looking way to save it) and also save it as a property of the object.
cvanoli commented 5 years ago

Regarding the choicemodels variant, I think would be better. As you mentioned, the results.results would be a tidier dict defined here https://github.com/UDST/choicemodels/blob/master/choicemodels/mnl.py#L262