Closed ntr34g closed 5 years ago
For fit_transform()
the first argument is supposed to be a pandas.DataFrame
and the second is supposed to be a vector-like object with a length equal to the number of rows the same as the DataFrame
(and both should have trivial indexing).
In your first example pd.DataFrame(train_labeled[column]).all()
is going to be a logical, not the required DataFrame
. Likely in your second example target
is a value, not a column. Assuming you are trying a classification problem, there is a worked example here: https://github.com/WinVector/pyvtreat/blob/master/Examples/Classification/Classification.ipynb
All right, so i added Target column to first argument DataFrame, it is now gives 'Categorical' error
cross_frame = plan.fit_transform(train_labeled, train_labeled['target'])
Traceback (most recent call last):
File "<ipython-input-59-9df623fdbd02>", line 1, in <module>
cross_frame = plan.fit_transform(train_labeled, train_labeled['target'])
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\__init__.py", line 233, in fit_transform
res = vtreat_impl.perform_transform(x=X, transform=self, params=self.params_)
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 546, in perform_transform
new_frames = [xfi.transform(x) for xfi in plan["xforms"]]
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 546, in <listcomp>
new_frames = [xfi.transform(x) for xfi in plan["xforms"]]
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 71, in transform
sf.loc[na_posns, incoming_column_name] = "_NA_"
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\indexing.py", line 190, in __setitem__
self._setitem_with_indexer(indexer, value)
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\indexing.py", line 656, in _setitem_with_indexer
value=value)
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\internals\managers.py", line 510, in setitem
return self.apply('setitem', **kwargs)
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\internals\managers.py", line 395, in apply
applied = getattr(b, f)(**kwargs)
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\internals\blocks.py", line 1752, in setitem
self.values[indexer] = value
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\arrays\categorical.py", line 2096, in __setitem__
raise ValueError("Cannot setitem on a Categorical with a new "
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
ok, so it works, with smaller chunks of large dataset (72 columns), maybe there is some limitations in columns size/RAM proportion or something.
i have mixed float and string categorical data in one of my columns... and it causes such error.
yes, something corrupted in dataset.
Ah, thanks for the notes. They are helpful. I myself have been surprised that types can be mixed in Pandas data frames. I am going to add some checks to get better error messages on that.
Yes, my dataset has 'object' dtype columns.
Solved by preprocessing.
Powered by TinyTake Screen Capture
I hope this information will help for development of this library.
Here is trace of error, when i used this loop:
for column in train_labeled:
dataframe = pd.DataFrame(train_labeled[column]).reset_index()
dataframe['target'] = target.values
train_labeled_vtreat.append(plan.fit_transform(dataframe.iloc[:,:(len(dataframe.columns)-1)], dataframe['target']))
it worked well untill it parsed column of "category" dtype:
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
Powered by TinyTake Screen Capture
so error on "category" dtype.
So i changed categorical columns -> ".astype(str)", and error is gone, Sorry for mess, im not very well experienced in pandas myself.
No problem. I need to add support for categorical columns.
Yerne, I really appreciate you taking the time to try this and apologize for any rough edges you are encountering. I am experimenting with more aggressive pre-processing of the DataFrame in the next version of vtreat (.astype(str)
as part of the vtreat
built in conversions). If you are interested you can preview it with pip install https://github.com/WinVector/pyvtreat/raw/master/pkg/dist/vtreat-0.2.3.tar.gz
(may need to pip uninstall vtreat
first to get the new version to install).
Yea, i will try it when ill got a spare moment.
ok, im uninstalled old and installed new ver. and disabled this check
if train_labeled[column].values.dtype.name =="category":
train_labeled[column]= train_labeled[column].astype(str)
run transform on fitted_transformed plan
plan = vtreat.BinomialOutcomeTreatment(outcome_target=False)
train_labeled_vtreat = dict()
train_labeled_vtreat_t = dict()
for column, col_t in zip(train_labeled, test_labeled):
# if train_labeled[column].values.dtype.name =="category":
# train_labeled[column]= train_labeled[column].astype(str)
# else:
# pass
# if test_labeled[col_t].values.dtype.name =="category":
# test_labeled[col_t]= test_labeled[column].astype(str)
# else:
# pass
dataframe = pd.DataFrame(train_labeled[column]).reset_index()
dataframe_t = pd.DataFrame(test_labeled[col_t]).reset_index()
del dataframe['TransactionID']
del dataframe_t['TransactionID']
dataframe['target'] = target.values
print(column)
train_labeled_vtreat[column] = plan.fit_transform(dataframe.iloc[:,:1], dataframe['target'])
train_labeled_vtreat_t[col_t] = plan.transform(dataframe_t)
it gave error:
Traceback (most recent call last):
File "<ipython-input-3-d03451cc70c1>", line 23, in <module>
train_labeled_vtreat_t[col_t] = plan.transform(dataframe_t)
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\__init__.py", line 207, in transform
res = vtreat_impl.perform_transform(x=X, transform=self, params=self.params_)
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 562, in perform_transform
new_frames = [xfi.transform(x) for xfi in plan["xforms"]]
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 562, in <listcomp>
new_frames = [xfi.transform(x) for xfi in plan["xforms"]]
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 71, in transform
sf.loc[na_posns, incoming_column_name] = "_NA_"
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\indexing.py", line 190, in __setitem__
self._setitem_with_indexer(indexer, value)
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\indexing.py", line 656, in _setitem_with_indexer
value=value)
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\internals\managers.py", line 510, in setitem
return self.apply('setitem', **kwargs)
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\internals\managers.py", line 395, in apply
applied = getattr(b, f)(**kwargs)
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\internals\blocks.py", line 1752, in setitem
self.values[indexer] = value
File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\pandas\core\arrays\categorical.py", line 2096, in __setitem__
raise ValueError("Cannot setitem on a Categorical with a new "
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
The current Python vtreat
isn't prepared to leave categorical columns as categorical (though that may make sense to add as a feature). The line you disabled isn't so much a check as the code converting the categorical column to strings. The error you see triggered later is when vtreat
then tries to replace a missing value with a new sentinel level "_NA_"
. In vtreat
all transform result columns are numeric, so a string or categorical column is not going to be directly in the result frame anyway (unless it is one of the "columns to copy", which are copied- but not transformed). The Python version of vtreat
is a new package- but it is following some of the ideas and semantics of the original R
version of the package: and the all transformed columns are numeric is a big part of the package intent.
I've tried to improve the README to make this more evident.
Thanks, your library and explanations are helpful to learn and make machine learning pipelines.
When tried to fit_transform on DataFrame got this:
or this: