awslabs / datawig

Imputation of missing values in tables.
Apache License 2.0
478 stars 69 forks source link

ValueError: fill value must be in categories #145

Closed rcruzgar closed 3 years ago

rcruzgar commented 3 years ago

Hi,

I am trying to impute numeric values from one specific column (it's called 'Comercializadora_encoded', and it is now a numeric column because I previously encoded the original object-type column with LabelEncoder() from sklearn).

This is are the column types I would like to input:

--> Provincia 166203 non-null float64 --> Consumo 166203 non-null float64 --> Potencia max 166203 non-null float64

And this one the column to impute:

--> Comercializadora_encoded 163937 non-null object

This is my code:

df_train, df_test = datawig.utils.random_split(df_copy)

imputer = datawig.SimpleImputer(
    input_columns=['Provincia', 'Consumo', 'Potencia max'],
    output_column= 'Comercializadora_encoded', 
    output_path = 'imputer_model' 
    )

imputer.fit(train_df=df_train, num_epochs=50)

imputed = imputer.predict(df_test)

And this is the error message I am getting:

2020-11-30 09:57:37,860 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 47 occurrences of value 16.0
2020-11-30 09:57:37,860 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 40 occurrences of value 7.0
2020-11-30 09:57:37,860 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 27 occurrences of value 44.0
2020-11-30 09:57:37,865 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 23 occurrences of value 66.0
2020-11-30 09:57:37,866 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 19 occurrences of value 29.0
2020-11-30 09:57:37,868 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 18 occurrences of value 28.0
2020-11-30 09:57:37,869 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 17 occurrences of value 56.0
2020-11-30 09:57:37,870 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 17 occurrences of value 21.0
2020-11-30 09:57:37,871 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 16 occurrences of value 81.0
2020-11-30 09:57:37,872 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 16 occurrences of value 34.0
2020-11-30 09:57:37,873 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 16 occurrences of value 74.0
2020-11-30 09:57:37,874 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 13 occurrences of value 43.0
2020-11-30 09:57:37,875 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 12 occurrences of value 1.0
2020-11-30 09:57:37,876 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 9 occurrences of value 52.0
2020-11-30 09:57:37,877 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 9 occurrences of value 38.0
2020-11-30 09:57:37,878 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 9 occurrences of value 9.0
2020-11-30 09:57:37,880 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 8 occurrences of value 12.0
2020-11-30 09:57:37,881 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 8 occurrences of value 25.0
2020-11-30 09:57:37,882 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 7 occurrences of value 69.0
2020-11-30 09:57:37,884 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 7 occurrences of value 79.0
2020-11-30 09:57:37,885 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 7 occurrences of value 63.0
2020-11-30 09:57:37,886 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 7 occurrences of value 6.0
2020-11-30 09:57:37,887 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 7 occurrences of value 76.0
2020-11-30 09:57:37,888 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 6 occurrences of value 67.0
2020-11-30 09:57:37,888 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 6 occurrences of value 54.0
2020-11-30 09:57:37,889 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 5 occurrences of value 26.0
2020-11-30 09:57:37,890 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 5 occurrences of value 20.0
2020-11-30 09:57:37,890 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 5 occurrences of value 48.0
2020-11-30 09:57:37,891 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 5 occurrences of value 49.0
2020-11-30 09:57:37,892 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 5 occurrences of value 10.0
2020-11-30 09:57:37,893 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 4 occurrences of value 23.0
2020-11-30 09:57:37,894 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 4 occurrences of value 53.0
2020-11-30 09:57:37,896 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 4 occurrences of value 5.0
2020-11-30 09:57:37,897 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 4 occurrences of value 36.0
2020-11-30 09:57:37,899 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 3 occurrences of value 57.0
2020-11-30 09:57:37,900 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 3 occurrences of value 27.0
2020-11-30 09:57:37,902 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 3 occurrences of value 0.0
2020-11-30 09:57:37,903 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 3 occurrences of value 17.0
2020-11-30 09:57:37,904 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 3 occurrences of value 2.0
2020-11-30 09:57:37,906 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 45.0
2020-11-30 09:57:37,907 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 71.0
2020-11-30 09:57:37,908 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 46.0
2020-11-30 09:57:37,909 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 4.0
2020-11-30 09:57:37,910 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 50.0
2020-11-30 09:57:37,911 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 14.0
2020-11-30 09:57:37,912 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 68.0
2020-11-30 09:57:37,913 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 2 occurrences of value 22.0
2020-11-30 09:57:37,914 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 59.0
2020-11-30 09:57:37,916 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 65.0
2020-11-30 09:57:37,917 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 42.0
2020-11-30 09:57:37,919 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 72.0
2020-11-30 09:57:37,920 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 77.0
2020-11-30 09:57:37,921 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 60.0
2020-11-30 09:57:37,922 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 8.0
2020-11-30 09:57:37,923 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 3.0
2020-11-30 09:57:37,924 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 82.0
2020-11-30 09:57:37,925 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 13.0
2020-11-30 09:57:37,926 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 33.0
2020-11-30 09:57:37,927 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 15.0
2020-11-30 09:57:37,928 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 37.0
2020-11-30 09:57:37,930 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 62.0
2020-11-30 09:57:37,931 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 75.0
2020-11-30 09:57:37,932 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 40.0
2020-11-30 09:57:37,933 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 41.0
2020-11-30 09:57:37,934 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 30.0
2020-11-30 09:57:37,935 [INFO]  CategoricalEncoder for column Comercializadora_encoded                                found only 1 occurrences of value 39.0
C:\Users\rcruz\Anaconda3\lib\site-packages\pandas\core\frame.py:3509: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-55b90ff782c9> in <module>
     10 
     11 ## Fit an imputer model on the train data
---> 12 imputer.fit(train_df=df_train, num_epochs=50)
     13 
     14 ## Impute missing values and return original dataframe with predictions

~\AppData\Roaming\Python\Python38\site-packages\datawig\simple_imputer.py in fit(self, train_df, test_df, ctx, learning_rate, num_epochs, patience, test_split, weight_decay, batch_size, final_fc_hidden_units, calibrate, class_weights, instance_weights)
    384         self.output_path = self.imputer.output_path
    385 
--> 386         self.imputer = self.imputer.fit(train_df, test_df, ctx, learning_rate, num_epochs, patience,
    387                                         test_split,
    388                                         weight_decay, batch_size,

~\AppData\Roaming\Python\Python38\site-packages\datawig\imputer.py in fit(self, train_df, test_df, ctx, learning_rate, num_epochs, patience, test_split, weight_decay, batch_size, final_fc_hidden_units, calibrate)
    261             train_df, test_df = random_split(train_df, [1.0 - test_split, test_split])
    262 
--> 263         iter_train, iter_test = self.__build_iterators(train_df, test_df, test_split)
    264 
    265         self.__check_data(test_df)

~\AppData\Roaming\Python\Python38\site-packages\datawig\imputer.py in __build_iterators(self, train_df, test_df, test_split)
    590 
    591         logger.debug("Building Train Iterator with {} elements".format(len(train_df)))
--> 592         iter_train = ImputerIterDf(
    593             data_frame=train_df,
    594             data_columns=self.data_encoders,

~\AppData\Roaming\Python\Python38\site-packages\datawig\iterators.py in __init__(self, data_frame, data_columns, label_columns, batch_size)
    221         numerical_columns = [c for c in data_frame.columns if is_numeric_dtype(data_frame[c])]
    222         string_columns = list(set(data_frame.columns) - set(numerical_columns))
--> 223         data_frame = data_frame.fillna(value={x: "" for x in string_columns})
    224         data_frame = data_frame.fillna(value={x: np.nan for x in numerical_columns})
    225 

~\Anaconda3\lib\site-packages\pandas\core\frame.py in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
   4250         **kwargs
   4251     ):
-> 4252         return super().fillna(
   4253             value=value,
   4254             method=method,

~\Anaconda3\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
   6272                         continue
   6273                     obj = result[k]
-> 6274                     obj.fillna(v, limit=limit, inplace=True, downcast=downcast)
   6275                 return result if not inplace else None
   6276 

~\Anaconda3\lib\site-packages\pandas\core\series.py in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
   4339         **kwargs
   4340     ):
-> 4341         return super().fillna(
   4342             value=value,
   4343             method=method,

~\Anaconda3\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
   6255                     )
   6256 
-> 6257                 new_data = self._data.fillna(
   6258                     value=value, limit=limit, inplace=inplace, downcast=downcast
   6259                 )

~\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in fillna(self, **kwargs)
    573 
    574     def fillna(self, **kwargs):
--> 575         return self.apply("fillna", **kwargs)
    576 
    577     def downcast(self, **kwargs):

~\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 

~\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in fillna(self, value, limit, inplace, downcast)
   1950     def fillna(self, value, limit=None, inplace=False, downcast=None):
   1951         values = self.values if inplace else self.values.copy()
-> 1952         values = values.fillna(value=value, limit=limit)
   1953         return [
   1954             self.make_block_same_class(

~\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    206                 else:
    207                     kwargs[new_arg_name] = new_arg_value
--> 208             return func(*args, **kwargs)
    209 
    210         return wrapper

~\Anaconda3\lib\site-packages\pandas\core\arrays\categorical.py in fillna(self, value, method, limit)
   1871             elif is_hashable(value):
   1872                 if not isna(value) and value not in self.categories:
-> 1873                     raise ValueError("fill value must be in categories")
   1874 
   1875                 mask = codes == -1

ValueError: fill value must be in categories

I've also tried to use categorical columns as input columns, and to convert the output column into a category. Am I missing something?

Thank you very much. Regards, Rubén.

felixbiessmann commented 3 years ago

Hi Rubén,

hm, this could be related to the fact that pandas converts your string columns to categorical columns, which raises an error, if you/datawig tries to set a value in a row to a value that is not in the allowed categories (because it was not previously observed).

A simple fix could be to force all your columns to be string columns instead of pandas-categorical ones, like:

for col in ['Provincia', 'Consumo', 'Potencia max', 'Comercializadora_encoded']:
   df_copy[col] = df_copy[col].astype(str)

and then train the imputer.

Let me know if that works?

Best Felix

felixbiessmann commented 3 years ago

closing for now, feel free to reopen

Vishuvrm commented 3 years ago

Really helpful, Thanks.

ioakeim-h commented 2 years ago

Just to let people know, df[col].astype(str) will convert any np.nan values to "nan" which are not recognised as missing and will thus not be imputed.

You may resolve this by converting "nan" values back to np.nan:

for col in df:
  if df[col].str.contains("nan").any():
    df[col].replace("nan", np.nan, inplace=True)