WinVector / pyvtreat

vtreat is a data frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. Distributed under a BSD-3-Clause license.
https://winvector.github.io/pyvtreat/
Other
119 stars 8 forks source link

error on DataFrame #2

Closed ntr34g closed 5 years ago

ntr34g commented 5 years ago

When tried to fit_transform on DataFrame got this:


  File "<ipython-input-34-4749a25525c1>", line 2, in <module>
    train_labeled_vtreat.append(plan.fit_transform(pd.DataFrame(train_labeled[column]).all(), target))

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\__init__.py", line 297, in fit_transform
    raise Exception("X should be a Pandas DataFrame")

Exception: X should be a Pandas DataFrame

or this:

cross_frame = plan.fit_transform(train_labeled, target)
Traceback (most recent call last):

  File "<ipython-input-35-e8d2bce0ab6b>", line 1, in <module>
    cross_frame = plan.fit_transform(train_labeled, target)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\__init__.py", line 315, in fit_transform
    params=self.params_,

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 460, in fit_multinomial_outcome_treatment
    params=params,

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 176, in fit_binomial_impact_code
    sf = vtreat.util.grouped_by_x_statistics(x, y)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\util.py", line 31, in grouped_by_x_statistics
    if n != len(y):

TypeError: len() of unsized object
JohnMount commented 5 years ago

For fit_transform() the first argument is supposed to be a pandas.DataFrame and the second is supposed to be a vector-like object with a length equal to the number of rows the same as the DataFrame (and both should have trivial indexing).

In your first example pd.DataFrame(train_labeled[column]).all() is going to be a logical, not the required DataFrame. Likely in your second example target is a value, not a column. Assuming you are trying a classification problem, there is a worked example here: https://github.com/WinVector/pyvtreat/blob/master/Examples/Classification/Classification.ipynb

ntr34g commented 5 years ago

All right, so i added Target column to first argument DataFrame, it is now gives 'Categorical' error



cross_frame = plan.fit_transform(train_labeled, train_labeled['target'])
Traceback (most recent call last):

  File "<ipython-input-59-9df623fdbd02>", line 1, in <module>
    cross_frame = plan.fit_transform(train_labeled, train_labeled['target'])

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\__init__.py", line 233, in fit_transform
    res = vtreat_impl.perform_transform(x=X, transform=self, params=self.params_)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 546, in perform_transform
    new_frames = [xfi.transform(x) for xfi in plan["xforms"]]

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 546, in <listcomp>
    new_frames = [xfi.transform(x) for xfi in plan["xforms"]]

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 71, in transform
    sf.loc[na_posns, incoming_column_name] = "_NA_"

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\indexing.py", line 190, in __setitem__
    self._setitem_with_indexer(indexer, value)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\indexing.py", line 656, in _setitem_with_indexer
    value=value)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\internals\managers.py", line 510, in setitem
    return self.apply('setitem', **kwargs)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\internals\managers.py", line 395, in apply
    applied = getattr(b, f)(**kwargs)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\internals\blocks.py", line 1752, in setitem
    self.values[indexer] = value

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\arrays\categorical.py", line 2096, in __setitem__
    raise ValueError("Cannot setitem on a Categorical with a new "

ValueError: Cannot setitem on a Categorical with a new category, set the categories first
ntr34g commented 5 years ago

ok, so it works, with smaller chunks of large dataset (72 columns), maybe there is some limitations in columns size/RAM proportion or something. i have mixed float and string categorical data in one of my columns... and it causes such error.

ntr34g commented 5 years ago

yes, something corrupted in dataset.

JohnMount commented 5 years ago

Ah, thanks for the notes. They are helpful. I myself have been surprised that types can be mixed in Pandas data frames. I am going to add some checks to get better error messages on that.

ntr34g commented 5 years ago

Yes, my dataset has 'object' dtype columns. Solved by preprocessing.
Powered by TinyTake Screen Capture

ntr34g commented 5 years ago

I hope this information will help for development of this library.

ntr34g commented 5 years ago

Here is trace of error, when i used this loop:

for column in train_labeled:
    dataframe = pd.DataFrame(train_labeled[column]).reset_index()
    dataframe['target'] = target.values
    train_labeled_vtreat.append(plan.fit_transform(dataframe.iloc[:,:(len(dataframe.columns)-1)], dataframe['target']))

it worked well untill it parsed column of "category" dtype:

ValueError: Cannot setitem on a Categorical with a new category, set the categories first


Powered by TinyTake Screen Capture so error on "category" dtype.

ntr34g commented 5 years ago

So i changed categorical columns -> ".astype(str)", and error is gone, Sorry for mess, im not very well experienced in pandas myself.

JohnMount commented 5 years ago

No problem. I need to add support for categorical columns.

JohnMount commented 5 years ago

Yerne, I really appreciate you taking the time to try this and apologize for any rough edges you are encountering. I am experimenting with more aggressive pre-processing of the DataFrame in the next version of vtreat (.astype(str) as part of the vtreat built in conversions). If you are interested you can preview it with pip install https://github.com/WinVector/pyvtreat/raw/master/pkg/dist/vtreat-0.2.3.tar.gz (may need to pip uninstall vtreat first to get the new version to install).

ntr34g commented 5 years ago

Yea, i will try it when ill got a spare moment.

ntr34g commented 5 years ago

ok, im uninstalled old and installed new ver. and disabled this check

if train_labeled[column].values.dtype.name =="category":
    train_labeled[column]= train_labeled[column].astype(str)

run transform on fitted_transformed plan

plan = vtreat.BinomialOutcomeTreatment(outcome_target=False)

train_labeled_vtreat = dict()
train_labeled_vtreat_t = dict()
for column, col_t in zip(train_labeled, test_labeled):
#    if train_labeled[column].values.dtype.name =="category":
#        train_labeled[column]= train_labeled[column].astype(str)
#    else:
#       pass
#  if test_labeled[col_t].values.dtype.name =="category":
#        test_labeled[col_t]= test_labeled[column].astype(str)
#   else:
#      pass
    dataframe = pd.DataFrame(train_labeled[column]).reset_index()
    dataframe_t = pd.DataFrame(test_labeled[col_t]).reset_index()
    del dataframe['TransactionID']
    del dataframe_t['TransactionID']
    dataframe['target'] = target.values
    print(column)
    train_labeled_vtreat[column] = plan.fit_transform(dataframe.iloc[:,:1], dataframe['target'])
    train_labeled_vtreat_t[col_t] = plan.transform(dataframe_t)

it gave error:

Traceback (most recent call last):

  File "<ipython-input-3-d03451cc70c1>", line 23, in <module>
    train_labeled_vtreat_t[col_t] = plan.transform(dataframe_t)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\__init__.py", line 207, in transform
    res = vtreat_impl.perform_transform(x=X, transform=self, params=self.params_)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 562, in perform_transform
    new_frames = [xfi.transform(x) for xfi in plan["xforms"]]

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 562, in <listcomp>
    new_frames = [xfi.transform(x) for xfi in plan["xforms"]]

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 71, in transform
    sf.loc[na_posns, incoming_column_name] = "_NA_"

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\indexing.py", line 190, in __setitem__
    self._setitem_with_indexer(indexer, value)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\indexing.py", line 656, in _setitem_with_indexer
    value=value)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\internals\managers.py", line 510, in setitem
    return self.apply('setitem', **kwargs)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\internals\managers.py", line 395, in apply
    applied = getattr(b, f)(**kwargs)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\internals\blocks.py", line 1752, in setitem
    self.values[indexer] = value

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\arrays\categorical.py", line 2096, in __setitem__
    raise ValueError("Cannot setitem on a Categorical with a new "

ValueError: Cannot setitem on a Categorical with a new category, set the categories first
JohnMount commented 5 years ago

The current Python vtreat isn't prepared to leave categorical columns as categorical (though that may make sense to add as a feature). The line you disabled isn't so much a check as the code converting the categorical column to strings. The error you see triggered later is when vtreat then tries to replace a missing value with a new sentinel level "_NA_". In vtreat all transform result columns are numeric, so a string or categorical column is not going to be directly in the result frame anyway (unless it is one of the "columns to copy", which are copied- but not transformed). The Python version of vtreat is a new package- but it is following some of the ideas and semantics of the original R version of the package: and the all transformed columns are numeric is a big part of the package intent.

I've tried to improve the README to make this more evident.

ntr34g commented 5 years ago

Thanks, your library and explanations are helpful to learn and make machine learning pipelines.