awslabs / datawig

Imputation of missing values in tables.
Apache License 2.0
475 stars 69 forks source link

Can a trained model be used to predict multiple columns with missing data? #175

Open pnoyens opened 2 months ago

pnoyens commented 2 months ago

Hi there,

I'm interested in trying out this library for a specific problem I'm dealing with. However, at this moment it is unclear to me if a model can be trained to predict missing values in more than 1 column of the tabular dataset.

When looking at the documentation, the SimpleImputer has a parameter for output_column, indicating only 1 column can be defined as the target. The Imputer interface however, has a label_encoder_cols parameter, indicating multiple columns can be defined for prediction.

Is this a typo, or does it mean that the library can indeed be used to predict multiple columns at a time?

felixbiessmann commented 2 months ago

Hi,

thanks a lot for your interest in this package. It's not maintained anymore and for your use case i'd recommend to use an actively maintained AutoML package for tabular data such as AutoGluon - most of the functionality in datawig is available in AutoGluon and the implementation is actually a lot better.

For the tabular prediction problem with all columns, i'd suggest following this tutorial and wrap it in a for loop going round robin on all columns: https://auto.gluon.ai/stable/tutorials/tabular/tabular-quick-start.html

Alternatively you could try the sklearn solutions to imputation, they also support random forest or KNN/hot-deck based imputations https://scikit-learn.org/stable/modules/impute.html

Best wishes Felix

felixbiessmann commented 2 months ago

... i forgot to mention: if you'd like to use datawig after all (and get it installed) then i guess SimpleImputer.complete is what would do the job: https://github.com/awslabs/datawig/tree/master?tab=readme-ov-file#quickstart-example