Closed Ynliu1017 closed 1 year ago
Hmm np.arange()
will return 0..(shape-1), so that should be the correct size. Can you post the data / a reproducible example? The only thing I can think of is if the data being imputed is a pandas dataframe that has a shuffled index or something.
I encountered this issue as well, and managed to reproduce it using the aids dataset from sksurv which has 1151 samples.
Code:
from sksurv.datasets import load_aids
from miceforest import ampute_data, ImputationKernel
# set random state
random=1234
# load breast cancer dataset
data = load_aids()[0]
# ampute dataset
data = ampute_data(
data,
perc=0.3,
random_state=random,
)
imp = ImputationKernel(
data,
datasets=5,
random_state=random,
)
imp.mice(iterations=5)
Error message:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In [34], line 23
11 data = ampute_data(
12 data,
13 perc=0.3,
14 random_state=random,
15 )
17 imp = ImputationKernel(
18 data,
19 datasets=5,
20 random_state=random,
21 )
---> 23 imp.mice(iterations=5)
File *\lib\site-packages\miceforest\ImputationKernel.py:1164, in ImputationKernel.mice(self, iterations, verbose, variable_parameters, compile_candidates, **kwlgb)
1161 mm_kwargs["hashed_seeds"] = None
1163 logger.set_start_time()
-> 1164 imp_values = self.mean_match_scheme._mean_match(
1165 variable, objective, **mm_kwargs
1166 )
1167 logger.record_time(timed_event="mean_matching", **log_context)
1169 assert imp_values.shape == (
1170 self.na_counts[variable],
1171 ), f"{variable} mean matching returned malformed array"
...
215 index_choice = knn_indices[np.arange(knn_indices.shape[0]), ind]
--> 217 imp_values = np.array(candidate_values)[index_choice]
219 return imp_values
IndexError: index 806 is out of bounds for axis 0 with size 806
The original dataset I was working on had ~700 samples in it, and from playing around this problem seems to rear its head most often with larger sample sizes
This appears to actually be a bug in scipy.spatial.KDtree. We'll have to see what they say.
Okay this is fixable on my end, I should have known this had to do with lightgbm outputting 0.0 probabilities of rare categories. The fix I'm probably going to implement is to set the logodds == 10 if the probability is 1.0, and -10 if the probability is 0.0.
For now, the best way to prevent this error is to not have any super rare categorical levels - anything less than 0.5% (1 500th) of the total count tends to trigger this.
Thanks for fixing the the error with lightgbm. I experienced the same error when not customizing to lightgbm and instead using the default random forest. Would you also be able to fix that on your end?
Yes I plan on fixing this error this week. It should fix all occurances of this error caused by lightgbm outputting 0.0/1.0 probabilities. Random forests should work as well.
Sounds perfect! Appreciate that.
This has been implemented in 5.6.3.
I found this index error. Think it should be shape[0] - 1.