Closed lucapalazzi closed 4 years ago
What to do depends on what you are trying to reproduce.
The standard use of vtreat
is: fit_transform()
once on training data, then transform()
on all future application and test data. If you want to use a similar set of variables on a later project, you can capture the variable names and later use them. You get complete control of what variables are produced if you set filter_to_recommended=False
in the vtreat
parameters and then decide which columns to retain outside of vtreat
.
The R
version of vtreat
has more direct control of which variables are produced, as the API is different than a standard sci-kit
-learn API. This is something we could port to the Python
version some time if there is enough interest in such a feature.
Actually, let me re-open that. Your points stayed with me, and I decided to port some of the R
features that help explicitly control the variable set to Python
. Sorry if I was hasty.
vtreat-0.4.6
is in testing/development now and will have a new method likely called .set_result_restriction()
which will specify which columns are to be produced. The current filter_to_recommended
parameter will change to use this code/data path to control which variables are produced. So overriding it becomes the simple step of explicitly calling .set_result_restriction()
after .fit()
or .fit_transform()
to declare control of the new variables produced.
I've added some ability to directly specify what columns to produce, and some examples using it can be found in the test here: https://github.com/WinVector/pyvtreat/blob/main/pkg/tests/test_result_restriction.py .
Hi, after applying the
fit_transform
and finding the set of useful features I would like to be able to use the same set obtained on another dataset composed of original features and different observations. Is there a specific way to achieve this reproducibility? I suppose that reapplying thefit_transform
can lead to a set of different features; I tried to do the application offit
andtransform
separately but maybe there is a apposite function (and theUserWarning: possibly called transform on same data used to fit (this causes over-fit, please use fit_transform() instead)
tells me that it is probably not the correct approach).Thanks in advance and congratulations on your excellent work.