AnotherSamWilson / miceforest

Multiple Imputation with LightGBM in Python
MIT License
340 stars 31 forks source link

Plot Correlation - NoneType object is not iterable. #72

Closed Nirvana2211 closed 2 months ago

Nirvana2211 commented 1 year ago

I am using mice forest version 5.6.2 on windows. I am trying to replicate the iris example.

import miceforest as mf from sklearn.datasets import load_iris import pandas as pd

iris = pd.concat(load_iris(as_frame = True, return_X_y = True), axis = 1) iris.rename(columns = {'target' :'species'}, inplace = True) iris['species'] = iris['species'].astype('category')

iris_amp = mf.ampute_data(iris, perc = 0.25, random_state = 1991)

kernel = mf.ImputationKernel( data=iris_amp, datasets=5, save_all_iterations=True, random_state=1991)

kernel.mice(3, verbose = True)

kernel.plot_correlations() gives the following error:


TypeError Traceback (most recent call last)

in ----> 1 kernel.plot_correlations() C:\Anaconda3\envs\base_small\lib\site-packages\miceforest\ImputedData.py in plot_correlations(self, datasets, variables, **adj_args) 727 else: 728 datasets = _ensure_iterable(datasets) --> 729 var_indx = self._get_var_ind_from_list(variables) 730 num_vars = self._get_num_vars(var_indx) 731 plots, plotrows, plotcols = self._prep_multi_plot(num_vars) C:\Anaconda3\envs\base_small\lib\site-packages\miceforest\ImputedData.py in _get_var_ind_from_list(self, variable_list) 316 ret = [ 317 int(self.column_names.index(x)) if isinstance(x, str) else int(x) --> 318 for x in variable_list 319 ] 320 TypeError: 'NoneType' object is not iterable
IanWord commented 1 year ago

Experiencing the same problem, have not found a solution yet.

AnotherSamWilson commented 1 year ago

Sorry for the late response, I'll look into this later tonight.

IanWord commented 1 year ago

Sorry for the late response, I'll look into this later tonight.

Hi AnotherSamWilson, thank you! I am having problems in general getting plots out. I have fitted a kernel to a dataset with some 140 features, where we only impute 43 of the features (i put in a list in variable_schema to indicate this). Plotting all of them is obviously not very pretty, but i notice that if i do, it plots 49 figures, 6 of them obviously empty. So, then i tried a for loop:

step = 5 # plot 5 variables at a time for i in range(0, len(imputable), step): kernel.plot_mean_convergence(variables=imputable[i:i+step], wspace=1.6, hspace=1.8)

And weirdly enough, each plot has at least one empty figure. Like the one below: Figure 2023-03-19 171638

Not sure what to do about it.

In addition, how do i get feature names on the above mean_convergence plot?

Thank you for your time.

AnotherSamWilson commented 1 year ago

When I last looked into these multiplots, I couldn't figure out how to prevent those empty plots from showing up... I'm not sure if it's possible. Either way, I want to get away from raw matplotlib in most plots, it's too much of a hassle.

IanWord commented 1 year ago

When I last looked into these multiplots, I couldn't figure out how to prevent those empty plots from showing up... I'm not sure if it's possible. Either way, I want to get away from raw matplotlib in most plots, it's too much of a hassle.

Okay, fair enough!

Do you have any suggestions for .save_kernel() when it returns an error like this:

raise ValueError("%s cannot be larger than %d bytes" % ValueError: bytesobj cannot be larger than 2147483631 bytes

At first, it stated I needed an optional dependency such as pyarrow/fastparquet, I went with pyarrow.

gorj-tessella commented 1 year ago

I believe the problem here is that variables=None by default, but then that is passed to _get_var_ind_from_list which requires an actual list. Presumably it should first sanitize and generate the full list of variables as is done in plot_imputed_distributions.

https://github.com/AnotherSamWilson/miceforest/blob/d9359a89204e3b5f10cc02e7e621a22c213e5453/miceforest/ImputedData.py#L596

AnotherSamWilson commented 2 months ago

This shouldn't be a problem in major version 6. The plotting functionality doesn't exist yet, but it should be much easier to implement with plotnine.