Automated out-of-sample validation of estimation results

smmaurer commented 6 years ago

This is a feature proposal to support automated out-of-sample validation of fitted models, across all the statistical templates.

For example, a user could fit a model using 2/3 of the available estimation data and then use the remaining data to calculate out-of-sample goodness-of-fit metrics. This kind of cross-validation helps avoid overfitting and gives a better sense of how well a predictive model will generalize.

To start out, "holdout method" cross-validation seems sufficient. For machine learning templates we might want something more systematic, like k-fold cross-validation.

Tagging @waddell @Arezoo-bz @mxndrwgrdnr for feedback.

Implementation and usage

Add an optional training_size parameter that can take a count or portion. If this is provided, estimate the model using the training subset.

The standard summary table will report results corresponding to the training data. We will also calculate separate goodness-of-fit metrics using the holdout data, which can be saved as validation_summary (or similar) in the model step object.

What to include in the validation summary? This may vary for different model types, and we'll need to do the calculations ourselves. Definitely an out-of-sample R^2 or pseudo-R^2. @Arezoo-bz, any other recommendations?

For a particular model step instance, we should always allocate the same records to the training set, for consistency when re-estimating a model. We can do this by setting a random seed and storing it in the saved model step.

Special case: data subsetting for performance improvement

Sometimes we want to subset the estimation data primarily to speed up estimation, e.g. for MNL with sampling of alternatives.

Setting a training_size should automatically handle this use case as well.

There could be a performance hit from automatically calculating the out-of-sample validation metrics, though, depending on the model. Maybe we should calculate these only when requested. We could also provide an option to use a small validation sample alongside a small training sample.

smmaurer commented 6 years ago

Update, after talking with @janowicz about the "special case" here of subsetting the choosers for MNL estimation performance improvement:

For Large MNL, I think it makes sense to name the sampling parameter chooser_sample_size. This works for both the performance improvement use case and the out-of-sample validation use case. And it's analogous to alt_sample_size, which is the parameter for sampling alternatives.

For simpler models (like OLS) we could call the parameter sample_size instead.

I'm going to go ahead and implement the basic sampling of choosers for Large MNL, which will be pretty easy. Cross-validation functionality and implementation for other model types is lower priority.

smmaurer commented 6 years ago

PR #33 implements the Large MNL chooser sampling.

UDST / urbansim_templates

Automated out-of-sample validation of estimation results #26

Implementation and usage

Special case: data subsetting for performance improvement