AnotherSamWilson / miceforest

Multiple Imputation with LightGBM in Python
MIT License
353 stars 31 forks source link

Using mean_match_candidates different from zero with categorical variables generates an error #49

Closed KaikeWesleyReis closed 2 years ago

KaikeWesleyReis commented 2 years ago

Hi,

So other issue that I found using categorical variables imputation (category dtype) is that defining mean_match_candidates != 0 in Kernel definition generate an issue during .mice().

Error message:

/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:377: RuntimeWarning: divide by zero encountered in true_divide
  odds_ratio = probability / (1 - probability)
/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:378: RuntimeWarning: divide by zero encountered in log
  log_odds = np.log(odds_ratio)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-106-6f8a47b02d4b> in <module>
      9                                    mean_match_candidates=2)
     10 # Run the MICE algorithm for X iterations - 12:09 - 13:22
---> 11 a.mice(iterations=2, verbose=True, n_jobs=-1)

/opt/conda/lib/python3.7/site-packages/miceforest/ImputationKernel.py in mice(self, iterations, verbose, variable_parameters, compile_candidates, **kwlgb)
   1193                                 random_state=self._random_state,
   1194                                 hashed_seeds=None,
-> 1195                                 candidate_preds=candidate_preds,
   1196                             )
   1197                         )

/opt/conda/lib/python3.7/site-packages/miceforest/mean_match_schemes.py in mean_match_function_kdtree_cat(mmc, model, bachelor_features, candidate_values, random_state, hashed_seeds, candidate_preds)
    361                 candidate_values,
    362                 random_state,
--> 363                 hashed_seeds,
    364             )
    365 

/opt/conda/lib/python3.7/site-packages/miceforest/mean_match_schemes.py in _mean_match_multiclass_accurate(mmc, bachelor_preds, candidate_preds, candidate_values, random_state, hashed_seeds)
    119 
    120     index_choice = knn_indices[np.arange(knn_indices.shape[0]), ind]
--> 121     imp_values = candidate_values[index_choice]
    122 
    123     return imp_values

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
    964             return self._get_values(key)
    965 
--> 966         return self._get_with(key)
    967 
    968     def _get_with(self, key):

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in _get_with(self, key)
    999             #  (i.e. self.iloc) or label-based (i.e. self.loc)
   1000             if not self.index._should_fallback_to_positional():
-> 1001                 return self.loc[key]
   1002             else:
   1003                 return self.iloc[key]

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    929 
    930             maybe_callable = com.apply_if_callable(key, self.obj)
--> 931             return self._getitem_axis(maybe_callable, axis=axis)
    932 
    933     def _is_scalar_access(self, key: tuple):

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1151                     raise ValueError("Cannot index with multidimensional key")
   1152 
-> 1153                 return self._getitem_iterable(key, axis=axis)
   1154 
   1155             # nested tuple slicing

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1091 
   1092         # A collection of keys
-> 1093         keyarr, indexer = self._get_listlike_indexer(key, axis)
   1094         return self.obj._reindex_with_indexers(
   1095             {axis: [keyarr, indexer]}, copy=True, allow_dups=True

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377             raise KeyError(f"{not_found} not in index")
   1378 
   1379 

KeyError: '[129846] not in index'

What I found is that using the default of mean_matching_scheme parameter was causing the issue (because will evaluate for categorical features the Mean Matching) and that using miceforest.mean_match_schemes.mean_match_scheme_fast_cat was turning around for that.

AnotherSamWilson commented 2 years ago

Can you show a reproducible example? There are tests that use mmc == 0 and mmc !=0 for both of the mean matching schemes, for categorical and numeric variables.

KaikeWesleyReis commented 2 years ago

I was trying to generate an error example with iris dataset, but without success. Here is an example:

import miceforest as mf
from sklearn.datasets import load_iris
# Load data and introduce missing values
iris = pd.concat(load_iris(as_frame=True,return_X_y=True),axis=1)
iris.rename({"target": "species"}, inplace=True, axis=1)
# Create an dummy column
iris['species2'] = iris['species'].copy()
iris.loc[len(iris)-1, 'species2'] = 'rare cat'
iris.loc[len(iris)-2, 'species2'] = 'rare cat'
iris.loc[len(iris)-3, 'species2'] = 'rare cat another'
# Convert to category type
iris['species'] = iris['species'].astype('category')
iris['species2'] = iris['species2'].astype('category')
iris_amp = mf.ampute_data(iris,perc=0.35,random_state=1991)
# Create kernel. 
kds = mf.ImputationKernel(data=iris_amp,
                          datasets=5,
                          save_models=2,
                          random_state=1206,
                          train_nonmissing=True,
                          save_all_iterations=True,
                          save_loggers=True,
                          #mean_match_scheme=mf.mean_match_schemes.mean_match_scheme_fast_cat,
                          mean_match_candidates=5)
# Run the MICE algorithm for 2 iterations
kds.mice(2, verbose=True, n_jobs=-1)

Side Note - @AnotherSamWilson , using n_jobs=-1 in MICE still works as determined (use all CPU power to execute the imputation procedure)?

The previous example works just fine. Possibly is related to my categorical column in my own dataset. Here is the value_counts(), and I believe that given the rare categories and the quantity of categories that I have here:

image

An error is generated given the multiclass prediction nature of the function _mean_match_multiclass_accurate.

Anyway, thanks for your reply here.

@AnotherSamWilson

AnotherSamWilson commented 2 years ago

I actually found a very rare error that occurs for category columns when mmc = 0 if there is a lot of data. Because mmc predictions are stored as float16, it can cause the predictions to add up to 0.9995 instead of 1.0. If a random number is then chosen above 0.9995, but below 1.0, it fails because it couldn't find the index.

In your real code that is failing, if you specify the prediction_dtypes parameter to be float32 for the failing column, does it fix the problem?

KaikeWesleyReis commented 2 years ago

I actually found a very rare error that occurs for category columns when mmc = 0 if there is a lot of data. Because mmc predictions are stored as float16, it can cause the predictions to add up to 0.9995 instead of 1.0. If a random number is then chosen above 0.9995, but below 1.0, it fails because it couldn't find the index.

In your real code that is failing, if you specify the prediction_dtypes parameter to be float32 for the failing column, does it fix the problem?

I have tried to as you recommended, but without success:

    # Define Imputer
    aux = mf.ImputationKernel(data=x_train_raw,
                              datasets=5,
                              prediction_dtypes=pred_dtypes,
                              save_models=2,
                              random_state=1206,
                              train_nonmissing=True,
                              save_all_iterations=True,
                              save_loggers=True,
                              #mean_match_scheme=mf.mean_match_schemes.mean_match_scheme_fast_cat,
                              mean_match_candidates=5)
    # Run the MICE algorithm for X iterations
    aux.mice(iterations=5, verbose=True, n_jobs=-1)

Where pred_dtypes are defined as: {'F1': 'float32', 'F2': 'float32'}

From my perspective is an issue related to multiclass + lot of data (my case) with rare categories. Before the failing error this happens:

/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:377: RuntimeWarning: divide by zero encountered in true_divide
  odds_ratio = probability / (1 - probability)
/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:378: RuntimeWarning: divide by zero encountered in log
  log_odds = np.log(odds_ratio)

So from my perspective, I'm having several zero division by rare categories.

EDIT - I just removed the problematic column (With rare categories) and the mice fails with other column with only 3 categories, so I revoke my previous suspicion.

AnotherSamWilson commented 2 years ago

Hmm that means that lightgbm output 1.0 and 0.0 probabilities somewhere. As a sanity check, can you confirm that your non-missing categorical data has at least 1 recognized value for each category that is passed? I'm wondering if lightgbm will say a probability is 0% missing if there is an associated class in the CategoricalDtype, but no actual values in the data.

Can you also confirm which version of miceforest you are running.

KaikeWesleyReis commented 2 years ago

Hmm that means that lightgbm output 1.0 and 0.0 probabilities somewhere. As a sanity check, can you confirm that your non-missing categorical data has at least 1 recognized value for each category that is passed? I'm wondering if lightgbm will say a probability is 0% missing if there is an associated class in the CategoricalDtype, but no actual values in the data.

Can you also confirm which version of miceforest you are running.

I think that I just found the issue: rare categories. I just execute the code removing two categorical columns that have rare categories (one of the columns have 3 values and other with 12 values for more than 180k records) and works just fine (no necessity for prediction_dtypes too).

My version is 5.5.4

This put some questions for me: when you construct the classifiers for multi category columns, there is any class_weight to avoid imbalanced scenarios? And why a rare category is generating this problem?

AnotherSamWilson commented 2 years ago

There is no weight included by default, but you can specify a positive and negative weight by passing them as lightgbm parameters. Keep in mind, if mean matching candidates == 0, mean matching will generate imputation values based on the lightgbm class probabilities, so you may be over-sampling an imbalanced class if you do this.

Unless you have data_subset set to something other than 1.0, I don't know why it would be failing for rare categories. I need to do some experimenting. The only thing I can think of is that lightgbm starts those categories off at such a low value, that it gets rounded down to 0. Thanks for bringing this up though.

KaikeWesleyReis commented 2 years ago

There is no weight included by default, but you can specify a positive and negative weight by passing them as lightgbm parameters. Keep in mind, if mean matching candidates == 0, mean matching will generate imputation values based on the lightgbm class probabilities, so you may be over-sampling an imbalanced class if you do this.

Unless you have data_subset set to something other than 1.0, I don't know why it would be failing for rare categories. I need to do some experimenting. The only thing I can think of is that lightgbm starts those categories off at such a low value, that it gets rounded down to 0. Thanks for bringing this up though.

Glad to help. Two questions before:

AnotherSamWilson commented 2 years ago

Glad to help. Two questions before:

* `n_jobs` works was defined in `mice()`?

* When we have categorical columns as my case, when we predict a numerical column, how is transformed those categorical columns to become model features?

By default lightgbm uses all threads. You can set n_jobs to change the thread number. Setting n_jobs=-1 will not do anything. Lightgbm does not transform categorical features, it treats them as categories. You can read about it (here)[https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features]

KaikeWesleyReis commented 2 years ago

But I set n_jobs=-1. This the standard?

AnotherSamWilson commented 2 years ago

Yes I believe that is the default in lightgbm. There are other things that will affect thread usage, like environment variables and how lightgbm was built. But typically, either passing no kwargs to mice or passing n_jobs=-1 will utilize all threads.

KaikeWesleyReis commented 2 years ago

Unless you have data_subset set to something other than 1.0 ...

What do you mean with data_subset? where I can find more about this argument?

AnotherSamWilson commented 2 years ago

See the ImputationKernel docs: https://miceforest.readthedocs.io/en/latest/ik/miceforest.ImputationKernel.html#miceforest.ImputationKernel

KaikeWesleyReis commented 2 years ago

See the ImputationKernel docs: https://miceforest.readthedocs.io/en/latest/ik/miceforest.ImputationKernel.html#miceforest.ImputationKernel

I changed this parameter for an integer >= mean_matching_candidates and execute correctly:

    # Define Imputer
    mice_imputer = mf.ImputationKernel(data=x_train_raw,
                                       datasets=5,
                                       data_subset=5,
                                       save_models=2,                
                                       random_state=1206,
                                       train_nonmissing=True,
                                       save_all_iterations=True,
                                       save_loggers=True,
                                       #mean_match_scheme=mf.mean_match_schemes.mean_match_scheme_fast_cat,
                                       mean_match_candidates=5)
    # Run the MICE algorithm for X iterations
    mice_imputer.mice(iterations=5, verbose=True, n_jobs=-1)

I will investigate further and understand this parameter as welll.

PS - data_subset = 1.0 fails the code

AnotherSamWilson commented 2 years ago

PS - data_subset = 1.0 fails the code

Can you show me the error? FYI the data_subset will subset the training data too, so setting it to 5 is probably way too low.

KaikeWesleyReis commented 2 years ago

PS - data_subset = 1.0 fails the code

Can you show me the error? FYI the data_subset will subset the training data too, so setting it to 5 is probably way too low.

/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:377: RuntimeWarning: divide by zero encountered in true_divide
  odds_ratio = probability / (1 - probability)
/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:378: RuntimeWarning: divide by zero encountered in log
  log_odds = np.log(odds_ratio)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-274-0a5b5f4d895a> in <module>
     12                                    mean_match_candidates=5)
     13 # Run the MICE algorithm for X iterations
---> 14 mice_imputer.mice(iterations=5, verbose=True, n_jobs=-1)

/opt/conda/lib/python3.7/site-packages/miceforest/ImputationKernel.py in mice(self, iterations, verbose, variable_parameters, compile_candidates, **kwlgb)
   1193                                 random_state=self._random_state,
   1194                                 hashed_seeds=None,
-> 1195                                 candidate_preds=candidate_preds,
   1196                             )
   1197                         )

/opt/conda/lib/python3.7/site-packages/miceforest/mean_match_schemes.py in mean_match_function_kdtree_cat(mmc, model, bachelor_features, candidate_values, random_state, hashed_seeds, candidate_preds)
    361                 candidate_values,
    362                 random_state,
--> 363                 hashed_seeds,
    364             )
    365 

/opt/conda/lib/python3.7/site-packages/miceforest/mean_match_schemes.py in _mean_match_multiclass_accurate(mmc, bachelor_preds, candidate_preds, candidate_values, random_state, hashed_seeds)
    119 
    120     index_choice = knn_indices[np.arange(knn_indices.shape[0]), ind]
--> 121     imp_values = candidate_values[index_choice]
    122 
    123     return imp_values

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
    964             return self._get_values(key)
    965 
--> 966         return self._get_with(key)
    967 
    968     def _get_with(self, key):

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in _get_with(self, key)
    999             #  (i.e. self.iloc) or label-based (i.e. self.loc)
   1000             if not self.index._should_fallback_to_positional():
-> 1001                 return self.loc[key]
   1002             else:
   1003                 return self.iloc[key]

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    929 
    930             maybe_callable = com.apply_if_callable(key, self.obj)
--> 931             return self._getitem_axis(maybe_callable, axis=axis)
    932 
    933     def _is_scalar_access(self, key: tuple):

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1151                     raise ValueError("Cannot index with multidimensional key")
   1152 
-> 1153                 return self._getitem_iterable(key, axis=axis)
   1154 
   1155             # nested tuple slicing

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1091 
   1092         # A collection of keys
-> 1093         keyarr, indexer = self._get_listlike_indexer(key, axis)
   1094         return self.obj._reindex_with_indexers(
   1095             {axis: [keyarr, indexer]}, copy=True, allow_dups=True

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377             raise KeyError(f"{not_found} not in index")
   1378 
   1379 

KeyError: '[129846] not in index'

And the code:

    # Define Imputer
    mice_imputer = mf.ImputationKernel(data=x_train_raw,
                                       datasets=5,
                                       data_subset=1.0,
                                       prediction_dtypes=pred_dtypes,
                                       save_models=2,                
                                       random_state=1206,
                                       train_nonmissing=True,
                                       save_all_iterations=True,
                                       save_loggers=True,
                                       #mean_match_scheme=mf.mean_match_schemes.mean_match_scheme_fast_cat,
                                       mean_match_candidates=5)
    # Run the MICE algorithm for X iterations
    mice_imputer.mice(iterations=5, verbose=True, n_jobs=-1)

Even if predict_dtypes 'float32' for all columns

AnotherSamWilson commented 2 years ago

@KaikeWesleyReis If you wouldn't mind, can you pull the MeanMatchScheme branch of this repo and build the package locally, and test for your use case?

To do this, the following commands should work in the terminal (Windows):

git clone https://github.com/AnotherSamWilson/miceforest.git
cd miceforest
git checkout MeanMatchScheme

python -m setup.py sdist
pip install dist/miceforest-5.6.0.tar.gz

If you have linux, the commands should be similar.

EDIT - I should also note that this version changes how mean matching is controlled. There is now a MeanMatchScheme class, which you probably won't need to mess with. The README for this branch has updated examples of how work with the new structure. Otherwise, things are pretty much the same as they were.

KaikeWesleyReis commented 2 years ago

@KaikeWesleyReis If you wouldn't mind, can you pull the MeanMatchScheme branch of this repo and build the package locally, and test for your use case?

Hi, I had problems to install this version in a Sagemaker notebook. It's possible for me to install as a pip command in the notebook?

AnotherSamWilson commented 2 years ago

I just did it successfully with this:

pip install git+https://github.com/AnotherSamWilson/miceforest.git@MeanMatchScheme
KaikeWesleyReis commented 2 years ago

I just did it successfully with this:

pip install git+https://github.com/AnotherSamWilson/miceforest.git@MeanMatchScheme

Hahah, I discovered this (again today) :D

I executed a test and I got an error of index:

image

AnotherSamWilson commented 2 years ago

The only way I can reproduce that error is if somehow the candidate_values is shorter than the candidate_preds. But I don't see how that is possible, both are returned from _make_features_label and should be the same size. Can you show me the script that threw that error?

Can you also show me the result of data.isnull().sum(0).

KaikeWesleyReis commented 2 years ago

The only way I can reproduce that error is if somehow the candidate_values is shorter than the candidate_preds. But I don't see how that is possible, both are returned from _make_features_label and should be the same size. Can you show me the script that threw that error?

Hi @AnotherSamWilson, I can try to execute. But, it's possible to insert in that branch a "print statement" to check the return of _make_features_label?

From my perspective, what could happen is a difference between the categories that my column have (for example X1, X2 and X3) and the training data actual values (X1 and X2 for example). Given the fact that I generate the dtype category BEFORE a train/test split.

The code that I'm using:

# Define Imputer
mice_imputer = mf.ImputationKernel(data=x_train_raw,
                                       datasets=5,
                                       data_subset=1.0,
                                       save_models=2,                
                                       random_state=1206,
                                       train_nonmissing=True,
                                       save_all_iterations=True,
                                       save_loggers=True)
                                       #mean_match_scheme=mms.mean_match_scheme_fast_cat,
                                       #mean_match_candidates=5)
# Run the MICE algorithm for X iterations
mice_imputer.mice(iterations=5, verbose=True, n_jobs=-1)

EDIT - @AnotherSamWilson what is the idea behind of data.isnull().sum(0)?

EDIT 2 - As you requested: an example of a categorical column that fails the imputation given index error:

error

Some points to have in mind:

That feature HAVE rare categories. The actual feature that got an error have the actual distribution for 3 classes: image1

It's important to note that the imputer didn't failed other two categorical features. One of them have the following distribution for 3 classes: img2

My suspicion is: the classifier was unable to predict an imbalanced category generating a zero, this zero impacts in the odds_ratio and a warning message is given. Due to this error, lightgbm is not able to generate a prediction to that class and so the index error appears, because it's doesn't have a prediction for that.

The reason why the imputer works without the mean match scheme for categorical, is because it takes the highest probability. So doesn't matter the result for the others categories.

AnotherSamWilson commented 2 years ago

Okay I finally got around to torturing lightgbm with imbalanced classes. I can get it to output a probability of 1.0 when the lowest class is around 0.05% of the total samples.

I see no other option than to throw a warning when categorical level counts are below some small threshold of the total. I don't think I can keep lightgbm from outputting a 1.0 or 0.0 probability when a certain level is only 0.05% of the total categories.

A good solution for you I think is to set the min_data_in_leaf parameter so that lightgbm can't single out those 3 values and give them a 0.0 probability. if you set min_data_in_leaf to 10 or something, does the error go away? Call this:

mice_imputer.mice(iterations=5, verbose=True, n_jobs=-1, min_data_in_leaf=10)
AnotherSamWilson commented 2 years ago

The problem with lightgbm outputting a 1.0 or 0.0 has been fixed as well as it can be. If you run into any different problems, please open a new issue. If this problem persists, my recommendation is to alter lightgbm parameters so that the model cannot output a 1.0 or 0.0 probability for any classes.

KaikeWesleyReis commented 2 years ago

Thanks for your return @AnotherSamWilson. I will try this approach in the next week. It's possible to pass any class_weight or sample_weight to lightgbm model?

And you are right: it's a failure of my data itself given the lack of categories.

AnotherSamWilson commented 2 years ago

It should be possible to use the scale_pos_weight parameter. You would need to pass it to variable_parameters specifically for the problem column.

And now that I think about it, I'm not convinced that this would cause much of a problem. If your predictions have been altered for the bachelors and the candidates, then the distribution of imputations might still be similar to the original distribution. I'll have to experiment with this next week, it could be a good solution to a problem like yours.