Closed mglowacki100 closed 4 years ago
Thanks,
I haven't finished the Pipeline integration, but you have given me some good pointers on steps to get there. I'll close this after I add some of the suggestions.
In version 0.3.4 of vtreat the transform implement a lot more of the sklearn step interface. Also we have a new really neat feature that warms if .fit(X, y).transform(X)
is ever called (looks like sklearn Pipeline calls .fit_transform(X, y)
, which is what vtreat wants to prevent over-fit issues using its cross-frame methodology). I have re-run your example here https://github.com/WinVector/pyvtreat/blob/master/Examples/Pipeline/Pipeline_Example.md
Thanks for helping with the package.
Thanks!
__repr__
method, e.g.
'cross_validation_plan': <vtreat.cross_plan.KWayCrossPlanYStratified object at 0x10fa81b50>
vtreat
directly in Pipeline
like in my example, but I'll think about it - Pipelines
are nice, also GridSearch
for vtreat
parameters would be cool.__repr__()
to the support classes.vtreat
parameters is that much a benefit. Would it be better if vtreat
hid its parameters from the pipeline?Also, vtreat
isn't using the cross-validation effects to search for hyper-parameter values. It is using them to try and avoid the nested model bias issue seen in not-cross validated stacked models. So there may be less of a connection to GridSearchCV
than it first appears.
I am going to stub-out the get/set parameters until I have some specific use-case/applications to code them to (are they tuned over during cross-validation, are they used to build new pipelines, are they just for display, are they used to simulate pickling?). I've added some more pretty-printing, but a lot of these objects are too complicated to be re-built from their printable form.
get_params
and set_params
, I see following cases:
i) when you extend base class get_params
with super
in child __init__
is often used to reduce boiler-plate code
ii) just for display
iii) just to have compatibility with sklearn.base.BaseEstimator
and avoid monkey-patching
I was thinking about fit(X_tr, y_tr).transform(X_tr)
vs fit_transform(X_tr, y_tr)
and correct me if I'm wrong:
i) the mismatch is when target-encoding is used for high-cardinality categorical variables, so e.g. zipcode 36104
from train could be target-encoded to 0.7 by fit_transform
but the same zipcode in test set could be target-encoded to 0.8. So, basically there are two "mappings"
ii) in fit
method, there is an internal call to self.fit_transform(X=X, y=y)
e.g. line 216: https://github.com/WinVector/pyvtreat/blob/master/pkg/build/lib/vtreat/vtreat_api.py
as result X is transformed anyway but result is not stored. So, here is an idea:
.X_fit_tr
and store result of internal .fit_transform
from .fit
in it.X_fit
and store input X or some hash-id of it to save memory .transform(X)
by adding condition if X == self.X_fit : return self.X_fit_tr
In such way there will be fit(X_tr, y_tr).transform(X_tr)
== fit_transform(X_tr, y_tr)
.X_fit_tr
just store "mapping" to get it during transform (if possible). This alternative is more memory efficient and also fit
is separated from transform
.Regarding GridSearchCV
I was thinking about e.g. indicator_min_fraction
parameter and checking values 0.05, 0.1, 0.2. Within pipeline it should be completly independent aside stuff from point 2.
Thanks for explanations!
First, thank you very much for spending so much time to give useful and productive advice. I've tried to incorporate a lot of it into vtreat
. It is very important to me that vtreat
be Pythonic and sklearn-idiomatic.
Back to your points.
Yes vtreat
uses cross-validated out of sample methods in fit_transform()
to actually implement fit
, and then throws the transform frame away. The out of sample frame is needed to get accurate estimates of out of sample performance for the score frame.
I've decided not to cache the result for user in a later transform()
step. My concerns are this is a reference leak to a large object, and I feel I need to really not paper-over the differences of simulated out of sample methods and split methods (using different data for .fit()
and .transform()
). It is indeed not-sklearn like to have .fit_transform(X, y)
return the same answer as .fit(X, y).transform(X)
. However it is also not safe to supply the user with .fit_transform(X, y)
when they call .fit(X, y).transform(X)
as the cross-validation gets rid of the very strong nested model bias in .fit(X, y).transform(X)
, but exposes a bit of negative nested model bias. So I want to encourage users that want to call .fit(X, y).transform(X)
to call .fit(X1, y).transform(X2)
where X1
, X2
are a random disjoint partition of X
.
Overall in the cross-validated mode not only do the impact_code
variables code to something different than through the .transform()
method, they are also not functions of the input variable alone even in the .fit_transform()
step- they are functions of both the input variable values and the cross fold ids.
I did add warnings based on caching the id of the data used in .fit()
. So I have made the issue more surface visible to the user.
I've spent some more time researching sklearn
objects and added a lot more methods to make the vtreat
steps duck-type to these structures.
Regarding parameters, I still am not exposing them. You correctly identified the most interesting one: indicator_min_fraction
. My intent there is: one would set indicator_min_fraction
to the smallest value you are willing to work with and then use a later sklearn
stage to throw away columns you do not want, or even leave this to the modeling step. I think this is fairly compatible with sklearn
, a bit more inconvenient but I think leaving column filtering to a later step is a good approach.
If you strongly disagree, or have new ideas, or I have missed something, please do re-open this issue or file another one. If anything is unclear open an issue and I will be happy to build up more documentation.
I've got it: configure which parameters are exposed to the pipeline controls during construction. I am going to work on that a bit.
I've worked out an example of vtreat
in a pipeline used in hyper-parameters search using an adapter: https://github.com/WinVector/pyvtreat/blob/main/Examples/Pipeline/Pipeline_Example.ipynb . Overall I don't find the combination that helpful, so unless I have a specific request with a good example I am not going to further integrate.
The issues include:
First of all really interesting project, that could save a lot of repetitive work and provide good baseline. I've tried to find example in docs that uses
Pipeline
fromscikit-learn
but I didn't, so this is my quick and dirty attempt based on yours:In general, it seems to work, but :
__repr__
,get_params
etc. to have nice representation inPipeline
get_feature_names
method to haveclf['preprocessor'].get_feature_names()
cols_to_copy
and dropy
manually, to avoid leakingy
vtreat.cross_plan...
could be replaced by validation schemes fromscikit-learn
likeGridSearchCV