AnotherSamWilson / miceforest

Multiple Imputation with LightGBM in Python
MIT License
353 stars 31 forks source link

Index error for multi class in mean match function #64

Closed Ynliu1017 closed 1 year ago

Ynliu1017 commented 2 years ago

image

I found this index error. Think it should be shape[0] - 1.

AnotherSamWilson commented 2 years ago

Hmm np.arange() will return 0..(shape-1), so that should be the correct size. Can you post the data / a reproducible example? The only thing I can think of is if the data being imputed is a pandas dataframe that has a shuffled index or something.

jcytam commented 2 years ago

I encountered this issue as well, and managed to reproduce it using the aids dataset from sksurv which has 1151 samples.

Code:

from sksurv.datasets import load_aids
from miceforest import ampute_data, ImputationKernel

# set random state
random=1234

# load breast cancer dataset
data = load_aids()[0]

# ampute dataset
data = ampute_data(
    data,
    perc=0.3,
    random_state=random,
)

imp = ImputationKernel(
    data,
    datasets=5,
    random_state=random,
)

imp.mice(iterations=5)

Error message:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In [34], line 23
     11 data = ampute_data(
     12     data,
     13     perc=0.3,
     14     random_state=random,
     15     )
     17 imp = ImputationKernel(
     18     data,
     19     datasets=5,
     20     random_state=random,
     21 )
---> 23 imp.mice(iterations=5)

File *\lib\site-packages\miceforest\ImputationKernel.py:1164, in ImputationKernel.mice(self, iterations, verbose, variable_parameters, compile_candidates, **kwlgb)
   1161     mm_kwargs["hashed_seeds"] = None
   1163 logger.set_start_time()
-> 1164 imp_values = self.mean_match_scheme._mean_match(
   1165     variable, objective, **mm_kwargs
   1166 )
   1167 logger.record_time(timed_event="mean_matching", **log_context)
   1169 assert imp_values.shape == (
   1170     self.na_counts[variable],
   1171 ), f"{variable} mean matching returned malformed array"
...
    215         index_choice = knn_indices[np.arange(knn_indices.shape[0]), ind]
--> 217     imp_values = np.array(candidate_values)[index_choice]
    219 return imp_values

IndexError: index 806 is out of bounds for axis 0 with size 806

The original dataset I was working on had ~700 samples in it, and from playing around this problem seems to rear its head most often with larger sample sizes

AnotherSamWilson commented 2 years ago

This appears to actually be a bug in scipy.spatial.KDtree. We'll have to see what they say.

AnotherSamWilson commented 2 years ago

Okay this is fixable on my end, I should have known this had to do with lightgbm outputting 0.0 probabilities of rare categories. The fix I'm probably going to implement is to set the logodds == 10 if the probability is 1.0, and -10 if the probability is 0.0.

For now, the best way to prevent this error is to not have any super rare categorical levels - anything less than 0.5% (1 500th) of the total count tends to trigger this.

Ynliu1017 commented 2 years ago

Thanks for fixing the the error with lightgbm. I experienced the same error when not customizing to lightgbm and instead using the default random forest. Would you also be able to fix that on your end?

AnotherSamWilson commented 2 years ago

Yes I plan on fixing this error this week. It should fix all occurances of this error caused by lightgbm outputting 0.0/1.0 probabilities. Random forests should work as well.

Ynliu1017 commented 2 years ago

Sounds perfect! Appreciate that.

AnotherSamWilson commented 1 year ago

This has been implemented in 5.6.3.