AnotherSamWilson / miceforest

Multiple Imputation with LightGBM in Python
MIT License
350 stars 31 forks source link

plot_imputed_distributions() fails with LinAlgError: Singular Matrix #10

Closed raj-shr-git closed 2 months ago

raj-shr-git commented 3 years ago

Hello,

Hope you are doing well!!

I was working with MiceForest on a toy dataset to understand how it works. And, during that came across "LinAlgError: singular matrix" while generating the plot of the imputed distribution.

Can you please refer to the below details and let me know whether I'm doing something wrong.

Step-1 - Create multiple kernel datasets kernel = mf.MultipleImputedKernel( data=iris_amp, save_all_iterations=True, datasets=10, mean_match_candidates=5, save_models=True, random_state=41 )

Step-2 - Run the MICE algorithm for 5 iterations on each dataset kernel.mice(5,verbose=True,max_depth=4)

Step-3 - Imputing the dataset new_data_imputed = kernel.impute_new_data(new_data)

Step-4 - This will return the 9th dataset new_completed_data = new_data_imputed.complete_data(9)

Step-5 - Plot the imputed distributions new_data_imputed.plot_imputed_distributions(wspace=0.5,hspace=0.8)

this results in the below error, however, step-5 works fine if I re-run step-3 & 4 and then try to plot the imputed distributions.

`LinAlgError Traceback (most recent call last)

in ----> 1 new_data_imputed.plot_imputed_distributions(wspace=0.5,hspace=0.8) c:\users\appdata\local\programs\python\python36\lib\site-packages\miceforest\MultipleImputedDataSet.py in plot_imputed_distributions(self, variables, iteration, **adj_args) 348 ) 349 for imparray in iteration_level_imputations.values(): --> 350 ax[axr, axc] = sns.kdeplot(imparray, color="black", linewidth=1) 351 352 plt.subplots_adjust(**adj_args) c:\users\appdata\local\programs\python\python36\lib\site-packages\seaborn\_decorators.py in inner_f(*args, **kwargs) 44 ) 45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)}) ---> 46 return f(**kwargs) 47 return inner_f 48 c:\users\appdata\local\programs\python\python36\lib\site-packages\seaborn\distributions.py in kdeplot(x, y, shade, vertical, kernel, bw, gridsize, cut, clip, legend, cumulative, shade_lowest, cbar, cbar_ax, cbar_kws, ax, weights, hue, palette, hue_order, hue_norm, multiple, common_norm, common_grid, levels, thresh, bw_method, bw_adjust, log_scale, color, fill, data, data2, **kwargs) 1733 legend=legend, 1734 estimate_kws=estimate_kws, -> 1735 **plot_kws, 1736 ) 1737 c:\users\appdata\local\programs\python\python36\lib\site-packages\seaborn\distributions.py in plot_univariate_density(self, multiple, common_norm, common_grid, fill, legend, estimate_kws, **plot_kws) 914 common_grid, 915 estimate_kws, --> 916 log_scale, 917 ) 918 c:\users\appdata\local\programs\python\python36\lib\site-packages\seaborn\distributions.py in _compute_univariate_density(self, data_variable, common_norm, common_grid, estimate_kws, log_scale) 314 315 # Estimate the density of observations at this level --> 316 density, support = estimator(observations, weights=weights) 317 318 if log_scale: c:\users\appdata\local\programs\python\python36\lib\site-packages\seaborn\_statistics.py in __call__(self, x1, x2, weights) 185 """Fit and evaluate on univariate or bivariate data.""" 186 if x2 is None: --> 187 return self._eval_univariate(x1, weights) 188 else: 189 return self._eval_bivariate(x1, x2, weights) c:\users\appdata\local\programs\python\python36\lib\site-packages\seaborn\_statistics.py in _eval_univariate(self, x, weights) 144 support = self.support 145 if support is None: --> 146 support = self.define_support(x, cache=False) 147 148 kde = self._fit(x, weights) c:\users\appdata\local\programs\python\python36\lib\site-packages\seaborn\_statistics.py in define_support(self, x1, x2, weights, cache) 117 """Create the evaluation grid for a given data set.""" 118 if x2 is None: --> 119 support = self._define_support_univariate(x1, weights) 120 else: 121 support = self._define_support_bivariate(x1, x2, weights) c:\users\appdata\local\programs\python\python36\lib\site-packages\seaborn\_statistics.py in _define_support_univariate(self, x, weights) 89 def _define_support_univariate(self, x, weights): 90 """Create a 1D grid of evaluation points.""" ---> 91 kde = self._fit(x, weights) 92 bw = np.sqrt(kde.covariance.squeeze()) 93 grid = self._define_support_grid( c:\users\appdata\local\programs\python\python36\lib\site-packages\seaborn\_statistics.py in _fit(self, fit_data, weights) 135 fit_kws["weights"] = weights 136 --> 137 kde = stats.gaussian_kde(fit_data, **fit_kws) 138 kde.set_bandwidth(kde.factor * self.bw_adjust) 139 c:\users\appdata\local\programs\python\python36\lib\site-packages\scipy\stats\kde.py in __init__(self, dataset, bw_method, weights) 204 self._neff = 1/sum(self._weights**2) 205 --> 206 self.set_bandwidth(bw_method=bw_method) 207 208 def evaluate(self, points): c:\users\appdata\local\programs\python\python36\lib\site-packages\scipy\stats\kde.py in set_bandwidth(self, bw_method) 554 raise ValueError(msg) 555 --> 556 self._compute_covariance() 557 558 def _compute_covariance(self): c:\users\appdata\local\programs\python\python36\lib\site-packages\scipy\stats\kde.py in _compute_covariance(self) 566 bias=False, 567 aweights=self.weights)) --> 568 self._data_inv_cov = linalg.inv(self._data_covariance) 569 570 self.covariance = self._data_covariance * self.factor**2 c:\users\appdata\local\programs\python\python36\lib\site-packages\scipy\linalg\basic.py in inv(a, overwrite_a, check_finite) 975 inv_a, info = getri(lu, piv, lwork=lwork, overwrite_lu=1) 976 if info > 0: --> 977 raise LinAlgError("singular matrix") 978 if info < 0: 979 raise ValueError('illegal value in %d-th argument of internal ' LinAlgError: singular matrix` ![LinAlgError](https://user-images.githubusercontent.com/51537572/124480873-55a36700-ddc5-11eb-97be-ffd12e0d3c11.jpg) **Below are python and other packages info:** python -- 3.6.2 Numpy -- 1.19.5 Pandas -- 1.1.2 Sklearn -- 0.23.2 Matplotlib -- 3.3.0 Seaborn -- 0.11.1 MiceForest -- 2.0.5
AnotherSamWilson commented 3 years ago

Hmmm are the imputed values for that second variable (top right in the chart) some constant value? I could see this error coming from trying to get the kernel smoothing parameter for a set of numbers with 0 variance.

raj-shr-git commented 3 years ago

I believe you are right, kindly refer to the below screenshot:

PW_0_Var_Imp_Data

I think because of the yellow highlighted imputed values this error is getting generated.

It seems like these 3 records have been considered similar in this iteration thus end up getting the same imputed value. In a bigger dataset, this might not happen because we will be having variability in at least some observations. As I have used only a 50 records dataset that too with only 3 missing observations for this variable might have led to this error. Right?

If that's the case doesn't it should show something like the Dirac-delta function instead of giving an error?

AnotherSamWilson commented 3 years ago

Thanks, well that's definitely a bug - I'll have to make it skip any imputations with 0 variance.

AnotherSamWilson commented 2 months ago

Should be fine in plotnine