WinVector / pyvtreat

vtreat is a data frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. Distributed under a BSD-3-Clause license.
https://winvector.github.io/pyvtreat/
Other
120 stars 8 forks source link

Indicator Code is False for has_range/example outdated? #7

Closed jtanman closed 5 years ago

jtanman commented 5 years ago

Even running the example code, my prepared data frame doesn't include indicator_code variables. When I check the transform.scoreframe I see that the indicator_code variables are False under has_range which might be why they weren't created. Is this intended, since obviously the example data does have different levels and varies. And possibly would it help to update the example to the latest behavior? Thank you!

ipdb> !d.head()
     x         y          xc        x2  x3
0  0.0 -0.111698  level_-0.0 -0.098463   1
1  0.1  0.270348   level_0.5  0.370653   1
2  0.2 -0.057853  level_-0.0  0.111180   1
3  0.3  0.412467   level_0.5  1.305242   1
4  0.4  0.469221   level_0.5  0.490332   1

ipdb> d_prepared.columns
Index(['y', 'xc_is_bad', 'x', 'x2', 'xc_prevalence_code'], dtype='object')

ipdb> transform.score_frame_
              variable orig_variable          treatment  y_aware  has_range  PearsonR  significance  recommended  vcount
0            xc_is_bad            xc  missing_indicator    False       True       NaN           NaN         True     1.0
1                    x             x         clean_copy    False       True       NaN           NaN         True     2.0
2                   x2            x2         clean_copy    False       True       NaN           NaN         True     2.0
3   xc_prevalence_code            xc    prevalence_code    False       True       NaN           NaN         True     1.0
4    xc_lev_level_-0.5            xc     indicator_code    False      False       NaN           NaN        False     7.0
5     xc_lev_level_1.0            xc     indicator_code    False      False       NaN           NaN        False     7.0
6     xc_lev_level_0.5            xc     indicator_code    False      False       NaN           NaN        False     7.0
7    xc_lev_level_-0.0            xc     indicator_code    False      False       NaN           NaN        False     7.0
8          xc_lev__NA_            xc     indicator_code    False      False       NaN           NaN        False     7.0
9     xc_lev_level_1.5            xc     indicator_code    False      False       NaN           NaN        False     7.0
10    xc_lev_level_0.0            xc     indicator_code    False      False       NaN           NaN        False     7.0
JohnMount commented 5 years ago

The indicators should code and have range unless they don't move, are rarer than a user setting, or prohibited by the user control. It looks like you were running a variation of https://github.com/WinVector/pyvtreat/blob/master/Examples/Unsupervised/Unsupervised.md . I just re-ran that with vtreat version 0.3.0 (the latest on PyPi) and the levels had range and did properly code for me ( https://github.com/WinVector/pyvtreat/blob/master/Examples/Unsupervised/Unsupervised.ipynb ). I would suggest re-installing vtreat to make sure you are at the latest version and re-running that example in its entirety. If you still have problems please re-open the with a complete failing example and I can see if I can look into it.

jtanman commented 5 years ago

So I uninstalled and reinstalled again, am using version 0.3.0, and am getting the same results. Here is my entire file and outputs.

import pkg_resources
import pandas
import numpy
import numpy.random
import seaborn
import matplotlib.pyplot as plt
import vtreat
import vtreat.util
import wvpy.util

import ipdb

def make_data(nrows):
    d = pandas.DataFrame({'x':[0.1*i for i in range(500)]})
    d['y'] = numpy.sin(d['x']) + 0.01*d['x'] +  0.1*numpy.random.normal(size=d.shape[0])
    d['xc'] = ['level_' + str(5*numpy.round(yi/5, 1)) for yi in d['y']]
    d['x2'] = numpy.random.normal(size=d.shape[0])
    d['x3'] = 1
    d.loc[d['xc']=='level_-1.0', 'xc'] = numpy.nan # introduce a nan level
    return d

d = make_data(500)

d.head()

d['xc'].unique()
d['xc'].value_counts(dropna=False)

transform = vtreat.UnsupervisedTreatment(
     cols_to_copy = ['y'],          # columns to "carry along" but not treat as input variables
)

d_prepared = transform.fit_transform(d)

ipdb.set_trace()

Outputs

ipdb> !d.head()
     x         y          xc        x2  x3
0  0.0  0.126101   level_0.0 -2.257337   1
1  0.1 -0.179864  level_-0.0 -0.764967   1
2  0.2  0.393881   level_0.5  1.112595   1
3  0.3  0.262937   level_0.5 -0.149371   1
4  0.4  0.484316   level_0.5  1.068538   1

ipdb> transform.score_frame_
              variable orig_variable          treatment  y_aware  has_range  PearsonR  significance  recommended  vcount
0            xc_is_bad            xc  missing_indicator    False       True       NaN           NaN         True     1.0
1                    x             x         clean_copy    False       True       NaN           NaN         True     2.0
2                   x2            x2         clean_copy    False       True       NaN           NaN         True     2.0
3   xc_prevalence_code            xc    prevalence_code    False       True       NaN           NaN         True     1.0
4     xc_lev_level_1.0            xc     indicator_code    False      False       NaN           NaN        False     7.0
5    xc_lev_level_-0.5            xc     indicator_code    False      False       NaN           NaN        False     7.0
6     xc_lev_level_0.5            xc     indicator_code    False      False       NaN           NaN        False     7.0
7    xc_lev_level_-0.0            xc     indicator_code    False      False       NaN           NaN        False     7.0
8     xc_lev_level_0.0            xc     indicator_code    False      False       NaN           NaN        False     7.0
9          xc_lev__NA_            xc     indicator_code    False      False       NaN           NaN        False     7.0
10    xc_lev_level_1.5            xc     indicator_code    False      False       NaN           NaN        False     7.0

ipdb> d_prepared.head()
          y  xc_is_bad    x        x2  xc_prevalence_code
0  0.126101        0.0  0.0 -2.257337               0.076
1 -0.179864        0.0  0.1 -0.764967               0.106
2  0.393881        0.0  0.2  1.112595               0.172
3  0.262937        0.0  0.3 -0.149371               0.172
4  0.484316        0.0  0.4  1.068538               0.172
jtanman commented 5 years ago

I also get the same result when running the ipython notebook btw. Here's the relevant part of the output:

d.head()
x y xc  x2  x3
0 0.0 0.046681  level_0.0 -1.484431 1
1 0.1 0.128256  level_0.0 0.675008  1
2 0.2 0.251777  level_0.5 -1.002255 1
3 0.3 0.339893  level_0.5 0.490650  1
4 0.4 0.328069  level_0.5 -0.738032 1
Some quick data exploration
Check how many levels xc has, and their distribution (including NaN)

d['xc'].unique()
array(['level_0.0', 'level_0.5', 'level_1.0', 'level_-0.0', 'level_-0.5',
       nan, 'level_1.5'], dtype=object)
d['xc'].value_counts(dropna=False)
level_-0.5    131
level_1.0     125
level_0.5      83
level_0.0      45
level_-0.0     44
level_1.5      39
NaN            33
Name: xc, dtype: int64
Build a transform appropriate for unsupervised (or non-y-aware) problems.
The vtreat package is primarily intended for data treatment prior to supervised learning, as detailed in the Classification and Regression examples. In these situations, vtreat specifically uses the relationship between the inputs and the outcomes in the training data to create certain types of synthetic variables. We call these more complex synthetic variables y-aware variables.

However, you may also want to use vtreat for basic data treatment for unsupervised problems, when there is no outcome variable. Or, you may not want to create any y-aware variables when preparing the data for supervised modeling. For these applications, vtreat is a convenient alternative to: pandas.get_dummies() or sklearn.preprocessing.OneHotEncoder().

In any case, we still want training data where all the input variables are numeric and have no missing values or NaNs.

First create the data treatment transform object, in this case a treatment for an unsupervised problem.

transform = vtreat.UnsupervisedTreatment(
     cols_to_copy = ['y'],          # columns to "carry along" but not treat as input variables
)  
Use the training data d to fit the transform and the return a treated training set: completely numeric, with no missing values.

d_prepared = transform.fit_transform(d)
Now examine the score frame, which gives information about each new variable, including its type and which original variable it is derived from. Some of the columns of the score frame (y_aware, PearsonR, significance and recommended) are not relevant to the unsupervised case; those columns are used by the Regression and Classification transforms.

transform.score_frame_
variable  orig_variable treatment y_aware has_range PearsonR  significance  recommended vcount
0 xc_is_bad xc  missing_indicator False True  NaN NaN True  1.0
1 x x clean_copy  False True  NaN NaN True  2.0
2 x2  x2  clean_copy  False True  NaN NaN True  2.0
3 xc_prevalence_code  xc  prevalence_code False True  NaN NaN True  1.0
4 xc_lev_level_-0.5 xc  indicator_code  False False NaN NaN False 7.0
5 xc_lev_level_1.0  xc  indicator_code  False False NaN NaN False 7.0
6 xc_lev_level_0.5  xc  indicator_code  False False NaN NaN False 7.0
7 xc_lev_level_0.0  xc  indicator_code  False False NaN NaN False 7.0
8 xc_lev_level_-0.0 xc  indicator_code  False False NaN NaN False 7.0
9 xc_lev_level_1.5  xc  indicator_code  False False NaN NaN False 7.0
10  xc_lev__NA_ xc  indicator_code  False False NaN NaN False 7.0
jtanman commented 5 years ago

So I did some of my own digging and isolated the problem to the numpy.asarray() function from this part of the code:

def has_range(x):
    x = numpy.asarray(x)
    return numpy.max(x) > numpy.min(x)

score_frame["has_range"] = [
    has_range(cross_frame[c]) for c in score_frame["variable"]
]

The numpy.asarray() function here transforms an array with multiple values into an array with one value. Letting cf be the variable cross_frame[c]. Then this is what numpy.asarray() does.

ipdb> cf
0     0.0
1     0.0
2     0.0
3     0.0
4     0.0
5     0.0
6     0.0
7     0.0
8     0.0
9     0.0
10    1.0
11    1.0
12    1.0
13    1.0
14    1.0
15    1.0
16    1.0
17    1.0
18    1.0
19    1.0
20    1.0
21    1.0
22    1.0
23    0.0
24    0.0
Name: xc_lev_level_1.0, dtype: float64
IntIndex
Indices: array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], dtype=int32)

ipdb> cf.max()
1.0
ipdb> cf.min()
0.0
ipdb> x = numpy.asarray(cf)
ipdb> x
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
ipdb> x.min()
1.0
ipdb> x.max()
1.0

Don't understand the entire codebase, so don't necessarily know what the fix is or what the purpose of the isarray function is, but this is definitely the issue here. Btw I'm using the latest version of numpy==1.17.3.

jtanman commented 5 years ago

I've been able to reproduce this issue with numpy and will submit an issue on their repo as well.

arr = np.random.randint(2, size=10)
sparr = pd.SparseArray(arr, fill_value=0)
np_arr = np.asarray(sparr)

ipdb> arr
array([1, 0, 1, 1, 1, 0, 1, 1, 0, 0])
ipdb> sparr
[1, 0, 1, 1, 1, 0, 1, 1, 0, 0]
Fill: 0
IntIndex
Indices: array([0, 2, 3, 4, 6, 7], dtype=int32)

ipdb> np_arr
array([1, 1, 1, 1, 1, 1])
jtanman commented 5 years ago

My suggestion is just to change the function from

def has_range(x):
    x = numpy.asarray(x)
    return numpy.max(x) > numpy.min(x)

to

def has_range(x):
    x = numpy.asarray(pandas.Series(x))
    return numpy.max(x) > numpy.min(x)

I tried creating a branch and submitting a pull request but am not super familiar with it/might not have access.

JohnMount commented 5 years ago

Wow, thanks for running that down. I'll see what I can do, and see about accepting your pull request.

JohnMount commented 5 years ago

Thanks for mentioning the numpy version. It turns out I am using numpy 1.16.4 and pandas 0.25.0, which seem to not show the issue you detected (which is why I didn't see it in vtreat, on my end). Here is my re-run: https://github.com/WinVector/pyvtreat/blob/master/Examples/Bugs/asarray_issue.md . I am looking into taking your fix, the build files are derived- so I'll rebuild those.

JohnMount commented 5 years ago

I am testing a fix (based on your idea, but a bit different than your pull request now): https://github.com/WinVector/pyvtreat/tree/master/pkg/dist

jtanman commented 5 years ago

Ah perfect, thank you for being so prompt! And thanks for developing vtreat in the first place, it's saved me so much time!