Closed jtanman closed 5 years ago
The indicators should code and have range unless they don't move, are rarer than a user setting, or prohibited by the user control. It looks like you were running a variation of https://github.com/WinVector/pyvtreat/blob/master/Examples/Unsupervised/Unsupervised.md . I just re-ran that with vtreat version 0.3.0 (the latest on PyPi) and the levels had range and did properly code for me ( https://github.com/WinVector/pyvtreat/blob/master/Examples/Unsupervised/Unsupervised.ipynb ). I would suggest re-installing vtreat to make sure you are at the latest version and re-running that example in its entirety. If you still have problems please re-open the with a complete failing example and I can see if I can look into it.
So I uninstalled and reinstalled again, am using version 0.3.0, and am getting the same results. Here is my entire file and outputs.
import pkg_resources
import pandas
import numpy
import numpy.random
import seaborn
import matplotlib.pyplot as plt
import vtreat
import vtreat.util
import wvpy.util
import ipdb
def make_data(nrows):
d = pandas.DataFrame({'x':[0.1*i for i in range(500)]})
d['y'] = numpy.sin(d['x']) + 0.01*d['x'] + 0.1*numpy.random.normal(size=d.shape[0])
d['xc'] = ['level_' + str(5*numpy.round(yi/5, 1)) for yi in d['y']]
d['x2'] = numpy.random.normal(size=d.shape[0])
d['x3'] = 1
d.loc[d['xc']=='level_-1.0', 'xc'] = numpy.nan # introduce a nan level
return d
d = make_data(500)
d.head()
d['xc'].unique()
d['xc'].value_counts(dropna=False)
transform = vtreat.UnsupervisedTreatment(
cols_to_copy = ['y'], # columns to "carry along" but not treat as input variables
)
d_prepared = transform.fit_transform(d)
ipdb.set_trace()
Outputs
ipdb> !d.head()
x y xc x2 x3
0 0.0 0.126101 level_0.0 -2.257337 1
1 0.1 -0.179864 level_-0.0 -0.764967 1
2 0.2 0.393881 level_0.5 1.112595 1
3 0.3 0.262937 level_0.5 -0.149371 1
4 0.4 0.484316 level_0.5 1.068538 1
ipdb> transform.score_frame_
variable orig_variable treatment y_aware has_range PearsonR significance recommended vcount
0 xc_is_bad xc missing_indicator False True NaN NaN True 1.0
1 x x clean_copy False True NaN NaN True 2.0
2 x2 x2 clean_copy False True NaN NaN True 2.0
3 xc_prevalence_code xc prevalence_code False True NaN NaN True 1.0
4 xc_lev_level_1.0 xc indicator_code False False NaN NaN False 7.0
5 xc_lev_level_-0.5 xc indicator_code False False NaN NaN False 7.0
6 xc_lev_level_0.5 xc indicator_code False False NaN NaN False 7.0
7 xc_lev_level_-0.0 xc indicator_code False False NaN NaN False 7.0
8 xc_lev_level_0.0 xc indicator_code False False NaN NaN False 7.0
9 xc_lev__NA_ xc indicator_code False False NaN NaN False 7.0
10 xc_lev_level_1.5 xc indicator_code False False NaN NaN False 7.0
ipdb> d_prepared.head()
y xc_is_bad x x2 xc_prevalence_code
0 0.126101 0.0 0.0 -2.257337 0.076
1 -0.179864 0.0 0.1 -0.764967 0.106
2 0.393881 0.0 0.2 1.112595 0.172
3 0.262937 0.0 0.3 -0.149371 0.172
4 0.484316 0.0 0.4 1.068538 0.172
I also get the same result when running the ipython notebook btw. Here's the relevant part of the output:
d.head()
x y xc x2 x3
0 0.0 0.046681 level_0.0 -1.484431 1
1 0.1 0.128256 level_0.0 0.675008 1
2 0.2 0.251777 level_0.5 -1.002255 1
3 0.3 0.339893 level_0.5 0.490650 1
4 0.4 0.328069 level_0.5 -0.738032 1
Some quick data exploration
Check how many levels xc has, and their distribution (including NaN)
d['xc'].unique()
array(['level_0.0', 'level_0.5', 'level_1.0', 'level_-0.0', 'level_-0.5',
nan, 'level_1.5'], dtype=object)
d['xc'].value_counts(dropna=False)
level_-0.5 131
level_1.0 125
level_0.5 83
level_0.0 45
level_-0.0 44
level_1.5 39
NaN 33
Name: xc, dtype: int64
Build a transform appropriate for unsupervised (or non-y-aware) problems.
The vtreat package is primarily intended for data treatment prior to supervised learning, as detailed in the Classification and Regression examples. In these situations, vtreat specifically uses the relationship between the inputs and the outcomes in the training data to create certain types of synthetic variables. We call these more complex synthetic variables y-aware variables.
However, you may also want to use vtreat for basic data treatment for unsupervised problems, when there is no outcome variable. Or, you may not want to create any y-aware variables when preparing the data for supervised modeling. For these applications, vtreat is a convenient alternative to: pandas.get_dummies() or sklearn.preprocessing.OneHotEncoder().
In any case, we still want training data where all the input variables are numeric and have no missing values or NaNs.
First create the data treatment transform object, in this case a treatment for an unsupervised problem.
transform = vtreat.UnsupervisedTreatment(
cols_to_copy = ['y'], # columns to "carry along" but not treat as input variables
)
Use the training data d to fit the transform and the return a treated training set: completely numeric, with no missing values.
d_prepared = transform.fit_transform(d)
Now examine the score frame, which gives information about each new variable, including its type and which original variable it is derived from. Some of the columns of the score frame (y_aware, PearsonR, significance and recommended) are not relevant to the unsupervised case; those columns are used by the Regression and Classification transforms.
transform.score_frame_
variable orig_variable treatment y_aware has_range PearsonR significance recommended vcount
0 xc_is_bad xc missing_indicator False True NaN NaN True 1.0
1 x x clean_copy False True NaN NaN True 2.0
2 x2 x2 clean_copy False True NaN NaN True 2.0
3 xc_prevalence_code xc prevalence_code False True NaN NaN True 1.0
4 xc_lev_level_-0.5 xc indicator_code False False NaN NaN False 7.0
5 xc_lev_level_1.0 xc indicator_code False False NaN NaN False 7.0
6 xc_lev_level_0.5 xc indicator_code False False NaN NaN False 7.0
7 xc_lev_level_0.0 xc indicator_code False False NaN NaN False 7.0
8 xc_lev_level_-0.0 xc indicator_code False False NaN NaN False 7.0
9 xc_lev_level_1.5 xc indicator_code False False NaN NaN False 7.0
10 xc_lev__NA_ xc indicator_code False False NaN NaN False 7.0
So I did some of my own digging and isolated the problem to the numpy.asarray() function from this part of the code:
def has_range(x):
x = numpy.asarray(x)
return numpy.max(x) > numpy.min(x)
score_frame["has_range"] = [
has_range(cross_frame[c]) for c in score_frame["variable"]
]
The numpy.asarray() function here transforms an array with multiple values into an array with one value. Letting cf be the variable cross_frame[c]. Then this is what numpy.asarray() does.
ipdb> cf
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
9 0.0
10 1.0
11 1.0
12 1.0
13 1.0
14 1.0
15 1.0
16 1.0
17 1.0
18 1.0
19 1.0
20 1.0
21 1.0
22 1.0
23 0.0
24 0.0
Name: xc_lev_level_1.0, dtype: float64
IntIndex
Indices: array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], dtype=int32)
ipdb> cf.max()
1.0
ipdb> cf.min()
0.0
ipdb> x = numpy.asarray(cf)
ipdb> x
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
ipdb> x.min()
1.0
ipdb> x.max()
1.0
Don't understand the entire codebase, so don't necessarily know what the fix is or what the purpose of the isarray function is, but this is definitely the issue here. Btw I'm using the latest version of numpy==1.17.3.
I've been able to reproduce this issue with numpy and will submit an issue on their repo as well.
arr = np.random.randint(2, size=10)
sparr = pd.SparseArray(arr, fill_value=0)
np_arr = np.asarray(sparr)
ipdb> arr
array([1, 0, 1, 1, 1, 0, 1, 1, 0, 0])
ipdb> sparr
[1, 0, 1, 1, 1, 0, 1, 1, 0, 0]
Fill: 0
IntIndex
Indices: array([0, 2, 3, 4, 6, 7], dtype=int32)
ipdb> np_arr
array([1, 1, 1, 1, 1, 1])
My suggestion is just to change the function from
def has_range(x):
x = numpy.asarray(x)
return numpy.max(x) > numpy.min(x)
to
def has_range(x):
x = numpy.asarray(pandas.Series(x))
return numpy.max(x) > numpy.min(x)
I tried creating a branch and submitting a pull request but am not super familiar with it/might not have access.
Wow, thanks for running that down. I'll see what I can do, and see about accepting your pull request.
Thanks for mentioning the numpy version. It turns out I am using numpy 1.16.4 and pandas 0.25.0, which seem to not show the issue you detected (which is why I didn't see it in vtreat, on my end). Here is my re-run: https://github.com/WinVector/pyvtreat/blob/master/Examples/Bugs/asarray_issue.md . I am looking into taking your fix, the build files are derived- so I'll rebuild those.
I am testing a fix (based on your idea, but a bit different than your pull request now): https://github.com/WinVector/pyvtreat/tree/master/pkg/dist
Ah perfect, thank you for being so prompt! And thanks for developing vtreat in the first place, it's saved me so much time!
Even running the example code, my prepared data frame doesn't include indicator_code variables. When I check the transform.scoreframe I see that the indicator_code variables are False under has_range which might be why they weren't created. Is this intended, since obviously the example data does have different levels and varies. And possibly would it help to update the example to the latest behavior? Thank you!