Closed keonho-kim closed 2 years ago
By default, the process should only be imputing columns that have missing values. If you're seeing models being run for any columns that don't have missing values (and train_nonmissing
is set to False), then can you post a reproducible example?
As an aside, you can specify which variables get imputed, by using the variable_schema
parameter, or you can get more complicated with it if you want to built models on non-missing data by setting train_nonmissing
to True and setting a combination of imputation_order
and variable_schema
parameters.
See this example:
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import miceforest as mf
random_state = np.random.RandomState(5)
boston = pd.DataFrame(load_boston(return_X_y=True)[0])
boston.columns = [str(i) for i in boston.columns]
boston["3"] = boston["3"].map({0: 'a', 1: 'b'}).astype('category')
boston["8"] = boston["8"].astype("category")
boston_amp = mf.ampute_data(boston, perc=0.25, random_state=random_state, variables=["4"])
kernel = mf.ImputationKernel(
data=boston_amp,
datasets=2,
save_models=1
)
kernel.mice(iterations=2, compile_candidates=True, verbose=True)
Only variable 4 gets a model run because only that variable had missing values.
@AnotherSamWilson
Oooh.. I didn't set train_nomissing as False :)
I missed variable_schema
parameter!
Problem solved! Thank you for quick reponse!
Hi :) I'm big fan of this library!
I'm trying to change source code to make it run in the way I want, but it is not that easy :(
I'm currently working on Kaggle competition with huge size of dataset including missing values.
Although this nice library and LGBM is pretty fast, but I want to fill columns which only have missing values, not for all columns.
my suggestion is adding list to specify columns in mice method.
instead of
put new variable
Also I'm wondering just adding one parameter to mice method, like,
Thank you for your awesome library ;)