Open smmaurer opened 6 years ago
In modelmanager.save_supplemental_object, change
content.to_pickle(filename)
to
pickle.dump(content, open(filename, 'wb'))
It will avoid to write a to_pickle attribute for every step that requires pickling.
Hmmm, good point. From some quick googling, it looks like pickle.dump
is the standard method, but sometimes a library will implement a custom to_pickle
method that's a lot more efficient, e.g. for dataframes.
So no objections to switching to pickle.dump
, but if the objects are large we might want to do some performance testing and use alternate methods for different use cases.
An additional potential issue: fit and predict from sklearn have a different structure as the ones in previous steps (OLSRegression, ....). Sklearn use matrixes for inputs and target data
model.fit(trainX, trainY
while
OLSRegression and others use
model.fit(data)
we could:
The challenge is to find a way for 2 without writing ugly code. Note that class inheritance is not recommended since the inputs type are not the same between the fit/predict from the parent class and the modified fit/predict. See what I have done in utils.py
Ah, right, interesting. Here's how I've been thinking about this kind of thing:
The underlying stats libraries we're using need a variety of different data formats, typically either a dataframe, a numpy array, or separate x and y numpy arrays..
When we run simulations, we need to abstract out the data management so that it's happening somewhere else. The Orca library is the interface between the templates and the data layer -- we request data by table name and column name, and get it as pandas objects. (And the data might be coming from memory, from disk, from a database, etc.)
There are also some extra data conventions we're trying to maintain in the templates, from earlier urbansim implementations: any time you request a table, you can (1) merge other tables onto it on the fly, and (2) also apply a list of filters to it. This functionality is mostly generalized into TemplateStep._get_df().
(Using Patsy-format model expressions is another convention, but you're right in the PR #50 discussion that this is probably related to Statsmodels supporting it directly. If it's too hard to map onto Scikit-learn models, then probably ok to drop it for these templates.)
This is a clever solution you've implemented with from_patsy_to_array()
and convert_to_model()
in utils.py
to create wrappers around Scikit-learn models so that they can directly fit and predict from dataframes. I worry a little that this might be over-engineering, and it would be simpler to have a couple of helper functions that we call from a template's fit()
or run()
method to convert a prepared dataframe to the format the underlying model wants. But absolutely fine to keep this as-is if you think it's better!
We could use helper functions in TemplateStep to convert the data into a format compatible with sklearn fit and predict methods.
The advantage of re-engineering those methods, however, is that whenever we call model.fit() or model.predict() we do not have to worry whether it is a statsmodel or sklearn methods (or something else). The cross_validate_score() helper is an example: it works whether the step is OLSRegression or RandomForest because it relies on a common structure for model.fit() and model.predict().
I'm setting up an issue for the random forest template that @Gitiauxx is working on! Tagging @waddell and @Arezoo-bz for feedback and additional guidance on use cases for the template.
Goals
Create a model step template for random forest regression models. This will be used for tasks like real estate price prediction, along the lines of this notebook: REPM_Random_Forest.ipynb
Tasks