WinVector / pyvtreat

vtreat is a data frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. Distributed under a BSD-3-Clause license.
https://winvector.github.io/pyvtreat/
Other
120 stars 8 forks source link

Reproducibility #16

Closed lucapalazzi closed 4 years ago

lucapalazzi commented 4 years ago

Hi, after applying the fit_transform and finding the set of useful features I would like to be able to use the same set obtained on another dataset composed of original features and different observations. Is there a specific way to achieve this reproducibility? I suppose that reapplying the fit_transform can lead to a set of different features; I tried to do the application of fit and transform separately but maybe there is a apposite function (and the UserWarning: possibly called transform on same data used to fit (this causes over-fit, please use fit_transform() instead) tells me that it is probably not the correct approach).

Thanks in advance and congratulations on your excellent work.

JohnMount commented 4 years ago

What to do depends on what you are trying to reproduce.

The standard use of vtreat is: fit_transform() once on training data, then transform() on all future application and test data. If you want to use a similar set of variables on a later project, you can capture the variable names and later use them. You get complete control of what variables are produced if you set filter_to_recommended=False in the vtreat parameters and then decide which columns to retain outside of vtreat.

The R version of vtreat has more direct control of which variables are produced, as the API is different than a standard sci-kit-learn API. This is something we could port to the Python version some time if there is enough interest in such a feature.

JohnMount commented 4 years ago

Actually, let me re-open that. Your points stayed with me, and I decided to port some of the R features that help explicitly control the variable set to Python. Sorry if I was hasty.

vtreat-0.4.6 is in testing/development now and will have a new method likely called .set_result_restriction() which will specify which columns are to be produced. The current filter_to_recommended parameter will change to use this code/data path to control which variables are produced. So overriding it becomes the simple step of explicitly calling .set_result_restriction() after .fit() or .fit_transform() to declare control of the new variables produced.

JohnMount commented 4 years ago

I've added some ability to directly specify what columns to produce, and some examples using it can be found in the test here: https://github.com/WinVector/pyvtreat/blob/main/pkg/tests/test_result_restriction.py .