multivariate imputation

feature-engine / feature_engine

Feature engineering package with sklearn like functionality

https://feature-engine.trainindata.com/

BSD 3-Clause "New" or "Revised" License

1.79k stars 304 forks source link

multivariate imputation #404

Open solegalli opened 2 years ago

solegalli commented 2 years ago

In multivariate imputation, we estimate the values of missing data using regression or classification models based of the other variables in the data.

The iterativeimputer will allows us only to use either one of regression or classification. But often we have binary, discrete and continuous variables in our datasets. So we would like to use a suitable model for each variable to carry out the imputation.

Can we design a transformer that does exactly so?

It would either recognise binary, multilcass and continuous variables or ask the user to enter them, and then train suitable models to predict the values of the missing data, for each variably type.

Morgan-Sell commented 2 years ago

Looks fun, @solegalli! I'm happy to tackle this issue. Which issue do you prefer we address first? This issue or #107?

Morgan-Sell commented 1 year ago

hola @solegalli,

I see sklearn has an experimental version of the IterativeImputer. Do we still want to implement this transformer into feature-engine?

When training the transformer's estimator, will the transformer organize the non-missing values for the dependent variable as the training set and all the np.nan values as the "test set" or values to be predicted?

Also, given there are most likely np.nan scattered throughout the dataset, I'm assuming we should limit the estimators to models that handle np.nan, e.g., random forest.

solegalli commented 1 year ago

Hi @Morgan-Sell

The iterativeImputer will return a continuous value to impute NA. But some variables are categorical, so instead of regression, classification would be more suitable.

Nan are handle during the subsequent rounds of imputation, like the iterativeimputer does.

So I guess, the only difference would be that our imputer is able to distinguish when to do regression and when to do imputation. Or maybe it could even give the user the option to pass a list of categorical and numerical variables.

Also, I've read the papers a while a go, but before drafting this class, it would be good to read the paper on MICE (multivariate imputation of chained equations) and MissForest.

Morgan-Sell commented 1 year ago

Hi @solegalli,

Yeah, I read a paper on MICE. I saw there that R has a MICE package.

I'm going to table this one for the moment to focus on the other transformers. Maybe one of our wonderful collaborators will pick this one up ;)