Template for random forest models

smmaurer commented 6 years ago

I'm setting up an issue for the random forest template that @Gitiauxx is working on! Tagging @waddell and @Arezoo-bz for feedback and additional guidance on use cases for the template.

Goals

Create a model step template for random forest regression models. This will be used for tasks like real estate price prediction, along the lines of this notebook: REPM_Random_Forest.ipynb

Tasks

build the template class
docstrings describing usage
demo notebook using some real-world data (could go in udst/public-template-workspace)
unit tests, verifying that the code runs and ideally doing a sanity check of the model output
pull request documenting the update!

Gitiauxx commented 6 years ago

In modelmanager.save_supplemental_object, change

content.to_pickle(filename) to pickle.dump(content, open(filename, 'wb'))

It will avoid to write a to_pickle attribute for every step that requires pickling.

smmaurer commented 6 years ago

Hmmm, good point. From some quick googling, it looks like pickle.dump is the standard method, but sometimes a library will implement a custom to_pickle method that's a lot more efficient, e.g. for dataframes.

So no objections to switching to pickle.dump, but if the objects are large we might want to do some performance testing and use alternate methods for different use cases.

Gitiauxx commented 6 years ago

An additional potential issue: fit and predict from sklearn have a different structure as the ones in previous steps (OLSRegression, ....). Sklearn use matrixes for inputs and target data

model.fit(trainX, trainY

while

OLSRegression and others use

model.fit(data)

we could:

Live with it. However, it is annoying when we want to define methods that should work across all steps (like cross validation)
Create a function in utils.py that takes a class and modify its fit and predict method to match the current structure.

The challenge is to find a way for 2 without writing ugly code. Note that class inheritance is not recommended since the inputs type are not the same between the fit/predict from the parent class and the modified fit/predict. See what I have done in utils.py

smmaurer commented 6 years ago

Ah, right, interesting. Here's how I've been thinking about this kind of thing:

The underlying stats libraries we're using need a variety of different data formats, typically either a dataframe, a numpy array, or separate x and y numpy arrays..

When we run simulations, we need to abstract out the data management so that it's happening somewhere else. The Orca library is the interface between the templates and the data layer -- we request data by table name and column name, and get it as pandas objects. (And the data might be coming from memory, from disk, from a database, etc.)

There are also some extra data conventions we're trying to maintain in the templates, from earlier urbansim implementations: any time you request a table, you can (1) merge other tables onto it on the fly, and (2) also apply a list of filters to it. This functionality is mostly generalized into TemplateStep._get_df().

(Using Patsy-format model expressions is another convention, but you're right in the PR #50 discussion that this is probably related to Statsmodels supporting it directly. If it's too hard to map onto Scikit-learn models, then probably ok to drop it for these templates.)

This is a clever solution you've implemented with from_patsy_to_array() and convert_to_model() in utils.py to create wrappers around Scikit-learn models so that they can directly fit and predict from dataframes. I worry a little that this might be over-engineering, and it would be simpler to have a couple of helper functions that we call from a template's fit() or run() method to convert a prepared dataframe to the format the underlying model wants. But absolutely fine to keep this as-is if you think it's better!

Gitiauxx commented 6 years ago

We could use helper functions in TemplateStep to convert the data into a format compatible with sklearn fit and predict methods.

The advantage of re-engineering those methods, however, is that whenever we call model.fit() or model.predict() we do not have to worry whether it is a statsmodel or sklearn methods (or something else). The cross_validate_score() helper is an example: it works whether the step is OLSRegression or RandomForest because it relies on a common structure for model.fit() and model.predict().

UDST / urbansim_templates

Template for random forest models #43

Goals

Tasks