vtreat and sklearn pipeline

mglowacki100 commented 4 years ago

First of all really interesting project, that could save a lot of repetitive work and provide good baseline. I've tried to find example in docs that uses Pipeline from scikit-learn but I didn't, so this is my quick and dirty attempt based on yours:

import pandas as pd
import numpy as np
import numpy.random
import vtreat
import vtreat.util
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

numpy.random.seed(2019)

def make_data(nrows):
    d = pd.DataFrame({'x': 5*numpy.random.normal(size=nrows)})
    d['y'] = numpy.sin(d['x']) + 0.1*numpy.random.normal(size=nrows)
    d.loc[numpy.arange(3, 10), 'x'] = numpy.nan                           # introduce a nan level
    d['xc'] = ['level_' + str(5*numpy.round(yi/5, 1)) for yi in d['y']]
    d['x2'] = np.random.normal(size=nrows)
    d.loc[d['xc']=='level_-1.0', 'xc'] = numpy.nan  # introduce a nan level
    d['yc'] = d['y']>0.5
    return d

df = make_data(500)

df = df.drop(columns=['y'])

transform = vtreat.BinomialOutcomeTreatment(outcome_target=True)

clf = Pipeline(steps=[
    ('preprocessor', transform),
    ('classifier', LogisticRegression())]
)

X, y = df, df.pop('yc')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)

print("model score: %.3f" % clf.score(X_test, y_test))

In general, it seems to work, but :

It'd be great to have: __repr__ , get_params etc. to have nice representation in Pipeline
similarly, get_feature_names method to have clf['preprocessor'].get_feature_names()
I don't use parameter cols_to_copy and drop y manually, to avoid leaking y
I'm not sure that vtreat.cross_plan... could be replaced by validation schemes from scikit-learn like GridSearchCV

JohnMount commented 4 years ago

Thanks,

I haven't finished the Pipeline integration, but you have given me some good pointers on steps to get there. I'll close this after I add some of the suggestions.

JohnMount commented 4 years ago

In version 0.3.4 of vtreat the transform implement a lot more of the sklearn step interface. Also we have a new really neat feature that warms if .fit(X, y).transform(X) is ever called (looks like sklearn Pipeline calls .fit_transform(X, y), which is what vtreat wants to prevent over-fit issues using its cross-frame methodology). I have re-run your example here https://github.com/WinVector/pyvtreat/blob/master/Examples/Pipeline/Pipeline_Example.md

Thanks for helping with the package.

mglowacki100 commented 4 years ago

Thanks!

Minor issue, a few classes lacks __repr__ method, e.g. 'cross_validation_plan': <vtreat.cross_plan.KWayCrossPlanYStratified object at 0x10fa81b50>
So, as far I see, It'd be hard to put properly vtreat directly in Pipeline like in my example, but I'll think about it - Pipelines are nice, also GridSearch for vtreat parameters would be cool.

JohnMount commented 4 years ago

Ah, yes I'll add the __repr__() to the support classes.
Not sure if grid-searching for vtreat parameters is that much a benefit. Would it be better if vtreat hid its parameters from the pipeline?

Also, vtreat isn't using the cross-validation effects to search for hyper-parameter values. It is using them to try and avoid the nested model bias issue seen in not-cross validated stacked models. So there may be less of a connection to GridSearchCV than it first appears.

JohnMount commented 4 years ago

I am going to stub-out the get/set parameters until I have some specific use-case/applications to code them to (are they tuned over during cross-validation, are they used to build new pipelines, are they just for display, are they used to simulate pickling?). I've added some more pretty-printing, but a lot of these objects are too complicated to be re-built from their printable form.

mglowacki100 commented 4 years ago

get_params and set_params, I see following cases: i) when you extend base class get_params with super in child __init__ is often used to reduce boiler-plate code ii) just for display iii) just to have compatibility with sklearn.base.BaseEstimator and avoid monkey-patching
I was thinking about fit(X_tr, y_tr).transform(X_tr) vs fit_transform(X_tr, y_tr) and correct me if I'm wrong: i) the mismatch is when target-encoding is used for high-cardinality categorical variables, so e.g. zipcode 36104 from train could be target-encoded to 0.7 by fit_transform but the same zipcode in test set could be target-encoded to 0.8. So, basically there are two "mappings" ii) in fit method, there is an internal call to self.fit_transform(X=X, y=y) e.g. line 216: https://github.com/WinVector/pyvtreat/blob/master/pkg/build/lib/vtreat/vtreat_api.py as result X is transformed anyway but result is not stored. So, here is an idea:
- add attribute .X_fit_tr and store result of internal .fit_transform from .fit in it
- add attribute .X_fit and store input X or some hash-id of it to save memory
- Next, you can modify .transform(X) by adding condition if X == self.X_fit : return self.X_fit_tr In such way there will be fit(X_tr, y_tr).transform(X_tr) == fit_transform(X_tr, y_tr)
  Alternatively, instead of storing dataframe in .X_fit_tr just store "mapping" to get it during transform (if possible). This alternative is more memory efficient and also fit is separated from transform.
Regarding GridSearchCV I was thinking about e.g. indicator_min_fraction parameter and checking values 0.05, 0.1, 0.2. Within pipeline it should be completly independent aside stuff from point 2.

Thanks for explanations!

JohnMount commented 4 years ago

First, thank you very much for spending so much time to give useful and productive advice. I've tried to incorporate a lot of it into vtreat. It is very important to me that vtreat be Pythonic and sklearn-idiomatic.

Back to your points.

Yes vtreat uses cross-validated out of sample methods in fit_transform() to actually implement fit, and then throws the transform frame away. The out of sample frame is needed to get accurate estimates of out of sample performance for the score frame.

I've decided not to cache the result for user in a later transform() step. My concerns are this is a reference leak to a large object, and I feel I need to really not paper-over the differences of simulated out of sample methods and split methods (using different data for .fit() and .transform()). It is indeed not-sklearn like to have .fit_transform(X, y) return the same answer as .fit(X, y).transform(X). However it is also not safe to supply the user with .fit_transform(X, y) when they call .fit(X, y).transform(X) as the cross-validation gets rid of the very strong nested model bias in .fit(X, y).transform(X), but exposes a bit of negative nested model bias. So I want to encourage users that want to call .fit(X, y).transform(X) to call .fit(X1, y).transform(X2) where X1, X2 are a random disjoint partition of X.

Overall in the cross-validated mode not only do the impact_code variables code to something different than through the .transform() method, they are also not functions of the input variable alone even in the .fit_transform() step- they are functions of both the input variable values and the cross fold ids.

I did add warnings based on caching the id of the data used in .fit(). So I have made the issue more surface visible to the user.

I've spent some more time researching sklearn objects and added a lot more methods to make the vtreat steps duck-type to these structures.

Regarding parameters, I still am not exposing them. You correctly identified the most interesting one: indicator_min_fraction. My intent there is: one would set indicator_min_fraction to the smallest value you are willing to work with and then use a later sklearn stage to throw away columns you do not want, or even leave this to the modeling step. I think this is fairly compatible with sklearn, a bit more inconvenient but I think leaving column filtering to a later step is a good approach.

If you strongly disagree, or have new ideas, or I have missed something, please do re-open this issue or file another one. If anything is unclear open an issue and I will be happy to build up more documentation.

JohnMount commented 4 years ago

I've got it: configure which parameters are exposed to the pipeline controls during construction. I am going to work on that a bit.

JohnMount commented 4 years ago

I've worked out an example of vtreat in a pipeline used in hyper-parameters search using an adapter: https://github.com/WinVector/pyvtreat/blob/main/Examples/Pipeline/Pipeline_Example.ipynb . Overall I don't find the combination that helpful, so unless I have a specific request with a good example I am not going to further integrate.

The issues include:

The grid search clones objects in addition to using get/set parameters. This means I would have to uglify the constructors to match the parameters (which is what I do in the adapter).
It is really slow as we are paying for a lot of nested instead of sequential cross-validation.
The vtreat parameters essentially get masked by other regularization parameters in the pipeline. This is also confirmation we are not too sensitive to these parameters, allowing us to leave more of them out.

WinVector / pyvtreat

vtreat and sklearn pipeline #12