Closed KaikeWesleyReis closed 2 years ago
Can you show a reproducible example? There are tests that use mmc == 0 and mmc !=0 for both of the mean matching schemes, for categorical and numeric variables.
I was trying to generate an error example with iris dataset, but without success. Here is an example:
import miceforest as mf
from sklearn.datasets import load_iris
# Load data and introduce missing values
iris = pd.concat(load_iris(as_frame=True,return_X_y=True),axis=1)
iris.rename({"target": "species"}, inplace=True, axis=1)
# Create an dummy column
iris['species2'] = iris['species'].copy()
iris.loc[len(iris)-1, 'species2'] = 'rare cat'
iris.loc[len(iris)-2, 'species2'] = 'rare cat'
iris.loc[len(iris)-3, 'species2'] = 'rare cat another'
# Convert to category type
iris['species'] = iris['species'].astype('category')
iris['species2'] = iris['species2'].astype('category')
iris_amp = mf.ampute_data(iris,perc=0.35,random_state=1991)
# Create kernel.
kds = mf.ImputationKernel(data=iris_amp,
datasets=5,
save_models=2,
random_state=1206,
train_nonmissing=True,
save_all_iterations=True,
save_loggers=True,
#mean_match_scheme=mf.mean_match_schemes.mean_match_scheme_fast_cat,
mean_match_candidates=5)
# Run the MICE algorithm for 2 iterations
kds.mice(2, verbose=True, n_jobs=-1)
Side Note - @AnotherSamWilson , using n_jobs=-1
in MICE still works as determined (use all CPU power to execute the imputation procedure)?
The previous example works just fine. Possibly is related to my categorical column in my own dataset. Here is the value_counts()
, and I believe that given the rare categories and the quantity of categories that I have here:
An error is generated given the multiclass prediction nature of the function _mean_match_multiclass_accurate
.
Anyway, thanks for your reply here.
@AnotherSamWilson
I actually found a very rare error that occurs for category columns when mmc = 0 if there is a lot of data. Because mmc predictions are stored as float16, it can cause the predictions to add up to 0.9995 instead of 1.0. If a random number is then chosen above 0.9995, but below 1.0, it fails because it couldn't find the index.
In your real code that is failing, if you specify the prediction_dtypes
parameter to be float32
for the failing column, does it fix the problem?
I actually found a very rare error that occurs for category columns when mmc = 0 if there is a lot of data. Because mmc predictions are stored as float16, it can cause the predictions to add up to 0.9995 instead of 1.0. If a random number is then chosen above 0.9995, but below 1.0, it fails because it couldn't find the index.
In your real code that is failing, if you specify the
prediction_dtypes
parameter to befloat32
for the failing column, does it fix the problem?
I have tried to as you recommended, but without success:
# Define Imputer
aux = mf.ImputationKernel(data=x_train_raw,
datasets=5,
prediction_dtypes=pred_dtypes,
save_models=2,
random_state=1206,
train_nonmissing=True,
save_all_iterations=True,
save_loggers=True,
#mean_match_scheme=mf.mean_match_schemes.mean_match_scheme_fast_cat,
mean_match_candidates=5)
# Run the MICE algorithm for X iterations
aux.mice(iterations=5, verbose=True, n_jobs=-1)
Where pred_dtypes
are defined as: {'F1': 'float32', 'F2': 'float32'}
From my perspective is an issue related to multiclass + lot of data (my case) with rare categories. Before the failing error this happens:
/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:377: RuntimeWarning: divide by zero encountered in true_divide
odds_ratio = probability / (1 - probability)
/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:378: RuntimeWarning: divide by zero encountered in log
log_odds = np.log(odds_ratio)
So from my perspective, I'm having several zero division by rare categories.
EDIT - I just removed the problematic column (With rare categories) and the mice fails with other column with only 3 categories, so I revoke my previous suspicion.
Hmm that means that lightgbm output 1.0 and 0.0 probabilities somewhere. As a sanity check, can you confirm that your non-missing categorical data has at least 1 recognized value for each category that is passed? I'm wondering if lightgbm will say a probability is 0% missing if there is an associated class in the CategoricalDtype
, but no actual values in the data.
Can you also confirm which version of miceforest
you are running.
Hmm that means that lightgbm output 1.0 and 0.0 probabilities somewhere. As a sanity check, can you confirm that your non-missing categorical data has at least 1 recognized value for each category that is passed? I'm wondering if lightgbm will say a probability is 0% missing if there is an associated class in the
CategoricalDtype
, but no actual values in the data.Can you also confirm which version of
miceforest
you are running.
I think that I just found the issue: rare categories. I just execute the code removing two categorical columns that have rare categories (one of the columns have 3 values and other with 12 values for more than 180k records) and works just fine (no necessity for prediction_dtypes
too).
My version is 5.5.4
This put some questions for me: when you construct the classifiers for multi category columns, there is any class_weight
to avoid imbalanced scenarios? And why a rare category is generating this problem?
There is no weight included by default, but you can specify a positive and negative weight by passing them as lightgbm parameters. Keep in mind, if mean matching candidates == 0, mean matching will generate imputation values based on the lightgbm class probabilities, so you may be over-sampling an imbalanced class if you do this.
Unless you have data_subset set to something other than 1.0, I don't know why it would be failing for rare categories. I need to do some experimenting. The only thing I can think of is that lightgbm starts those categories off at such a low value, that it gets rounded down to 0. Thanks for bringing this up though.
There is no weight included by default, but you can specify a positive and negative weight by passing them as lightgbm parameters. Keep in mind, if mean matching candidates == 0, mean matching will generate imputation values based on the lightgbm class probabilities, so you may be over-sampling an imbalanced class if you do this.
Unless you have data_subset set to something other than 1.0, I don't know why it would be failing for rare categories. I need to do some experimenting. The only thing I can think of is that lightgbm starts those categories off at such a low value, that it gets rounded down to 0. Thanks for bringing this up though.
Glad to help. Two questions before:
n_jobs
works was defined in mice()
? Glad to help. Two questions before:
* `n_jobs` works was defined in `mice()`? * When we have categorical columns as my case, when we predict a numerical column, how is transformed those categorical columns to become model features?
By default lightgbm uses all threads. You can set n_jobs to change the thread number. Setting n_jobs=-1
will not do anything.
Lightgbm does not transform categorical features, it treats them as categories. You can read about it (here)[https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features]
But I set n_jobs=-1
. This the standard?
Yes I believe that is the default in lightgbm. There are other things that will affect thread usage, like environment variables and how lightgbm was built. But typically, either passing no kwargs
to mice or passing n_jobs=-1
will utilize all threads.
Unless you have data_subset set to something other than 1.0 ...
What do you mean with data_subset
? where I can find more about this argument?
See the ImputationKernel docs: https://miceforest.readthedocs.io/en/latest/ik/miceforest.ImputationKernel.html#miceforest.ImputationKernel
See the ImputationKernel docs: https://miceforest.readthedocs.io/en/latest/ik/miceforest.ImputationKernel.html#miceforest.ImputationKernel
I changed this parameter for an integer >= mean_matching_candidates
and execute correctly:
# Define Imputer
mice_imputer = mf.ImputationKernel(data=x_train_raw,
datasets=5,
data_subset=5,
save_models=2,
random_state=1206,
train_nonmissing=True,
save_all_iterations=True,
save_loggers=True,
#mean_match_scheme=mf.mean_match_schemes.mean_match_scheme_fast_cat,
mean_match_candidates=5)
# Run the MICE algorithm for X iterations
mice_imputer.mice(iterations=5, verbose=True, n_jobs=-1)
I will investigate further and understand this parameter as welll.
PS - data_subset = 1.0
fails the code
PS - data_subset = 1.0 fails the code
Can you show me the error? FYI the data_subset will subset the training data too, so setting it to 5 is probably way too low.
PS - data_subset = 1.0 fails the code
Can you show me the error? FYI the data_subset will subset the training data too, so setting it to 5 is probably way too low.
/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:377: RuntimeWarning: divide by zero encountered in true_divide
odds_ratio = probability / (1 - probability)
/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:378: RuntimeWarning: divide by zero encountered in log
log_odds = np.log(odds_ratio)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-274-0a5b5f4d895a> in <module>
12 mean_match_candidates=5)
13 # Run the MICE algorithm for X iterations
---> 14 mice_imputer.mice(iterations=5, verbose=True, n_jobs=-1)
/opt/conda/lib/python3.7/site-packages/miceforest/ImputationKernel.py in mice(self, iterations, verbose, variable_parameters, compile_candidates, **kwlgb)
1193 random_state=self._random_state,
1194 hashed_seeds=None,
-> 1195 candidate_preds=candidate_preds,
1196 )
1197 )
/opt/conda/lib/python3.7/site-packages/miceforest/mean_match_schemes.py in mean_match_function_kdtree_cat(mmc, model, bachelor_features, candidate_values, random_state, hashed_seeds, candidate_preds)
361 candidate_values,
362 random_state,
--> 363 hashed_seeds,
364 )
365
/opt/conda/lib/python3.7/site-packages/miceforest/mean_match_schemes.py in _mean_match_multiclass_accurate(mmc, bachelor_preds, candidate_preds, candidate_values, random_state, hashed_seeds)
119
120 index_choice = knn_indices[np.arange(knn_indices.shape[0]), ind]
--> 121 imp_values = candidate_values[index_choice]
122
123 return imp_values
/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
964 return self._get_values(key)
965
--> 966 return self._get_with(key)
967
968 def _get_with(self, key):
/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in _get_with(self, key)
999 # (i.e. self.iloc) or label-based (i.e. self.loc)
1000 if not self.index._should_fallback_to_positional():
-> 1001 return self.loc[key]
1002 else:
1003 return self.iloc[key]
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
929
930 maybe_callable = com.apply_if_callable(key, self.obj)
--> 931 return self._getitem_axis(maybe_callable, axis=axis)
932
933 def _is_scalar_access(self, key: tuple):
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1151 raise ValueError("Cannot index with multidimensional key")
1152
-> 1153 return self._getitem_iterable(key, axis=axis)
1154
1155 # nested tuple slicing
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
1091
1092 # A collection of keys
-> 1093 keyarr, indexer = self._get_listlike_indexer(key, axis)
1094 return self.obj._reindex_with_indexers(
1095 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1313
-> 1314 self._validate_read_indexer(keyarr, indexer, axis)
1315
1316 if needs_i8_conversion(ax.dtype) or isinstance(
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377 raise KeyError(f"{not_found} not in index")
1378
1379
KeyError: '[129846] not in index'
And the code:
# Define Imputer
mice_imputer = mf.ImputationKernel(data=x_train_raw,
datasets=5,
data_subset=1.0,
prediction_dtypes=pred_dtypes,
save_models=2,
random_state=1206,
train_nonmissing=True,
save_all_iterations=True,
save_loggers=True,
#mean_match_scheme=mf.mean_match_schemes.mean_match_scheme_fast_cat,
mean_match_candidates=5)
# Run the MICE algorithm for X iterations
mice_imputer.mice(iterations=5, verbose=True, n_jobs=-1)
Even if predict_dtypes 'float32' for all columns
@KaikeWesleyReis If you wouldn't mind, can you pull the MeanMatchScheme branch of this repo and build the package locally, and test for your use case?
To do this, the following commands should work in the terminal (Windows):
git clone https://github.com/AnotherSamWilson/miceforest.git
cd miceforest
git checkout MeanMatchScheme
python -m setup.py sdist
pip install dist/miceforest-5.6.0.tar.gz
If you have linux, the commands should be similar.
EDIT - I should also note that this version changes how mean matching is controlled. There is now a MeanMatchScheme
class, which you probably won't need to mess with. The README for this branch has updated examples of how work with the new structure. Otherwise, things are pretty much the same as they were.
@KaikeWesleyReis If you wouldn't mind, can you pull the MeanMatchScheme branch of this repo and build the package locally, and test for your use case?
Hi, I had problems to install this version in a Sagemaker notebook. It's possible for me to install as a pip command in the notebook?
I just did it successfully with this:
pip install git+https://github.com/AnotherSamWilson/miceforest.git@MeanMatchScheme
I just did it successfully with this:
pip install git+https://github.com/AnotherSamWilson/miceforest.git@MeanMatchScheme
Hahah, I discovered this (again today) :D
I executed a test and I got an error of index:
The only way I can reproduce that error is if somehow the candidate_values is shorter than the candidate_preds. But I don't see how that is possible, both are returned from _make_features_label
and should be the same size. Can you show me the script that threw that error?
Can you also show me the result of data.isnull().sum(0)
.
The only way I can reproduce that error is if somehow the candidate_values is shorter than the candidate_preds. But I don't see how that is possible, both are returned from
_make_features_label
and should be the same size. Can you show me the script that threw that error?
Hi @AnotherSamWilson, I can try to execute. But, it's possible to insert in that branch a "print statement" to check the return of _make_features_label
?
From my perspective, what could happen is a difference between the categories that my column have (for example X1, X2 and X3) and the training data actual values (X1 and X2 for example). Given the fact that I generate the dtype category BEFORE a train/test split.
The code that I'm using:
# Define Imputer
mice_imputer = mf.ImputationKernel(data=x_train_raw,
datasets=5,
data_subset=1.0,
save_models=2,
random_state=1206,
train_nonmissing=True,
save_all_iterations=True,
save_loggers=True)
#mean_match_scheme=mms.mean_match_scheme_fast_cat,
#mean_match_candidates=5)
# Run the MICE algorithm for X iterations
mice_imputer.mice(iterations=5, verbose=True, n_jobs=-1)
EDIT - @AnotherSamWilson what is the idea behind of data.isnull().sum(0)
?
EDIT 2 - As you requested: an example of a categorical column that fails the imputation given index error:
Some points to have in mind:
That feature HAVE rare categories. The actual feature that got an error have the actual distribution for 3 classes:
It's important to note that the imputer didn't failed other two categorical features. One of them have the following distribution for 3 classes:
/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:377: RuntimeWarning: divide by zero encountered in true_divide
odds_ratio = probability / (1 - probability)
My suspicion is: the classifier was unable to predict an imbalanced category generating a zero, this zero impacts in the odds_ratio and a warning message is given. Due to this error, lightgbm is not able to generate a prediction to that class and so the index error appears, because it's doesn't have a prediction for that.
The reason why the imputer works without the mean match scheme for categorical, is because it takes the highest probability. So doesn't matter the result for the others categories.
Okay I finally got around to torturing lightgbm with imbalanced classes. I can get it to output a probability of 1.0 when the lowest class is around 0.05% of the total samples.
I see no other option than to throw a warning when categorical level counts are below some small threshold of the total. I don't think I can keep lightgbm from outputting a 1.0 or 0.0 probability when a certain level is only 0.05% of the total categories.
A good solution for you I think is to set the min_data_in_leaf
parameter so that lightgbm can't single out those 3 values and give them a 0.0 probability. if you set min_data_in_leaf
to 10 or something, does the error go away? Call this:
mice_imputer.mice(iterations=5, verbose=True, n_jobs=-1, min_data_in_leaf=10)
The problem with lightgbm outputting a 1.0 or 0.0 has been fixed as well as it can be. If you run into any different problems, please open a new issue. If this problem persists, my recommendation is to alter lightgbm parameters so that the model cannot output a 1.0 or 0.0 probability for any classes.
Thanks for your return @AnotherSamWilson. I will try this approach in the next week. It's possible to pass any class_weight or sample_weight to lightgbm model?
And you are right: it's a failure of my data itself given the lack of categories.
It should be possible to use the scale_pos_weight
parameter. You would need to pass it to variable_parameters
specifically for the problem column.
And now that I think about it, I'm not convinced that this would cause much of a problem. If your predictions have been altered for the bachelors and the candidates, then the distribution of imputations might still be similar to the original distribution. I'll have to experiment with this next week, it could be a good solution to a problem like yours.
Hi,
So other issue that I found using categorical variables imputation (
category dtype
) is that definingmean_match_candidates != 0
in Kernel definition generate an issue during.mice()
.Error message:
What I found is that using the default of
mean_matching_scheme
parameter was causing the issue (because will evaluate for categorical features the Mean Matching) and that usingmiceforest.mean_match_schemes.mean_match_scheme_fast_cat
was turning around for that.