AnotherSamWilson / miceforest

Multiple Imputation with LightGBM in Python
MIT License
353 stars 31 forks source link

Could you add skip columns? #55

Closed keonho-kim closed 2 years ago

keonho-kim commented 2 years ago

Hi :) I'm big fan of this library!

I'm trying to change source code to make it run in the way I want, but it is not that easy :(

I'm currently working on Kaggle competition with huge size of dataset including missing values.

Although this nice library and LGBM is pretty fast, but I want to fill columns which only have missing values, not for all columns.

my suggestion is adding list to specify columns in mice method.

instead of

for variable in self.variable_training_order:

put new variable

for var in self.variable_people_select

Also I'm wondering just adding one parameter to mice method, like,

    def mice(
        self,
        iterations=2,
        verbose=False,
        variable_parameters=None,
        compile_candidates=False,
        user_specified_columns=None, <- 
        **kwlgb,
    ):

Thank you for your awesome library ;)

AnotherSamWilson commented 2 years ago

By default, the process should only be imputing columns that have missing values. If you're seeing models being run for any columns that don't have missing values (and train_nonmissing is set to False), then can you post a reproducible example?

As an aside, you can specify which variables get imputed, by using the variable_schema parameter, or you can get more complicated with it if you want to built models on non-missing data by setting train_nonmissing to True and setting a combination of imputation_order and variable_schema parameters.

AnotherSamWilson commented 2 years ago

See this example:

from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import miceforest as mf

random_state = np.random.RandomState(5)
boston = pd.DataFrame(load_boston(return_X_y=True)[0])
boston.columns = [str(i) for i in boston.columns]
boston["3"] = boston["3"].map({0: 'a', 1: 'b'}).astype('category')
boston["8"] = boston["8"].astype("category")
boston_amp = mf.ampute_data(boston, perc=0.25, random_state=random_state, variables=["4"])

kernel = mf.ImputationKernel(
    data=boston_amp,
    datasets=2,
    save_models=1
)
kernel.mice(iterations=2, compile_candidates=True, verbose=True)

Only variable 4 gets a model run because only that variable had missing values.

keonho-kim commented 2 years ago

@AnotherSamWilson

Oooh.. I didn't set train_nomissing as False :)

I missed variable_schema parameter!

Problem solved! Thank you for quick reponse!