agnesdeng / misle

Multiple imputation through statistical learning
5 stars 0 forks source link

Impute test data using model fit only on trained data #1

Open rexdouglass opened 3 years ago

rexdouglass commented 3 years ago

Thank you for a wonderful tool.

One feature I couldn't quickly figure out from the documentation is whether you can apply a previously trained model to a new dataset.

This is important for training/test splits, where we need to fill missing throughout but we can only learn how from the training data.

agnesdeng commented 3 years ago

Thank you for a wonderful tool.

One feature I couldn't quickly figure out from the documentation is whether you can apply a previously trained model to a new dataset.

This is important for training/test splits, where we need to fill missing throughout but we can only learn how from the training data.

Hi Rex, thanks for your comments. We currently use Bootstrapping to account for the model uncertainty and XGBoost has a few regularisations to help with overfitting problems so we haven't added the training/test splits yet.

In the future, we will add this to our package and we are planning to run simulations to compare the performance of the current framework with the one with training/test splits. We reckon training/test splits could be quite useful for multiple imputation using autoencoders.

rexdouglass commented 3 years ago

Just to make sure we're on the same page, I'm talking about training on a subset of data, and imputing on new unseen data using an old trained model laying around. I don't mean anything to do with the uncertainty during fitting or something that would vary with autoencoder vs xgboost. I'm sure there's ways to use that internally to validate how imputation is going, but I mean in terms of a larger pipeline that I need to validate.

Here's the language micrforest uses to describe that https://github.com/AnotherSamWilson/miceforest image

agnesdeng commented 3 years ago

Hi Rex,

Is my interpretation of what miceRanger does correct?

If we have an incomplete dataset and we get 5 imputed datasets using miceRanger. Five different imputation models were saved.

When we get a new incomplete dataset, we can use the previously saved 5 models to impute this new dataset and obtain 5 imputed datasets.

rexdouglass commented 3 years ago

Take a dataset with missingness. Partition it into a training split and a test split. Never touch the test split until prediction time.

Use miceRanger/misle to impute missing on train split. Save the model it learned and used. Use its imputed dataset(s) to train and fit whatever other model you're building.

Test time, finally pull out the test set stashed away earlier. Use the saved miceRanger/misle model to impute missing on this test set. Do not change or retrain that model in anyway, predictions only. Use new imputed test set to make final predictions with your main ML model.

The number of iterations is incidental and up to the user. What's important here is that when we learn to do imputation, we don't use any test data that might leak knowledge back into our training set that we're going to use to train some other different ML model.

[edit] There is only ever a single model 5 produced datasets are just samples from that prediction so you can propagate uncertainty downstream. Could be 1 draw or 1000, but just one trained model.

On Thu, Jul 15, 2021 at 8:08 PM (Agnes) Yongshi Deng < @.***> wrote:

Hi Rex,

Is my interpretation of what miceRanger does correct?

If we have an incomplete dataset and we get 5 imputed datasets using miceRanger. Five different imputation models were saved.

When we get a new incomplete dataset, we can use the previously saved 5 models to impute this new dataset and obtain 5 imputed datasets.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/agnesdeng/misle/issues/1#issuecomment-881143176, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYAIKKN6DHREJOHOZWIGQ3TX6PDJANCNFSM5ADPFN6A .

agnesdeng commented 3 years ago

Hi Rex,

Thanks for your clarifications.

We did some simulations for multiple imputation based on XGBoost. If we want to obtain 20 imputed datasets and just use one 1 single trained model with sampling from predictions, the within-imputation variance may still be alright but the between-imputation variance would be underestimated (i.e, 20 imputed datasets are different but they are too similar). The same thing happens for autoencoders. We haven't investigated whether random forests would have the same issues.

Our main goal is to get the variance correct so that Rubin's rule can be applied for inference. We haven't implemented any feature to impute new data yet. We may investigate this in the future.

asheetal commented 3 years ago

I have same issue as Rex's. Bulk of imputation methods are "in-band", or work on the same dataset. This is prohibited in social sciences where test dataset must not affect the training dataset, and the test/train split happens as the first line of the machine learning code. So imputation can be done only in training dataset and I can use the information results to impute/predict the test dataset. but not vice-versa. the Caret::preProcess works well in this regard. I can use the preprocess imputation model on train.df and feed newdata=test.df and predict/impute the test dataset. (But knnImpute and bagImpute is so slow)

I believe Rex and I are talking about a trained model where we can feed any new data and impute it.

Below is the psuedo code suggestion for misle. Most applied machine learning users will pick your R package up if it can work as per this psuedo code.

library(caret)
data(iris)
df <- iris #assuming this dataset has some missing values

idx <- createDataPartition(df$Species, p = .9, 
                                  list = FALSE, 
                                  times = 1)

df.train.unimputed <- df[ idx, ]
df.test.unimputed <- df[-idx, ]
rm(df) #very important step to make sure training dataset never sees the test dataset

miModel <- ##### something goes here that is built using mivae/midae/mixgb and using df.train.unimputed only. 
df.train.imputed <- miModel(newdata = df.train.unimputed) #or something equivalent
df.test.imputed <- miModel(newdata = df.test.unimputed)
agnesdeng commented 3 years ago

Hi, Asheetal & Rex. I am currently working on adding the training/test split feature to the R package mixgb. I'll let you know once it's done.

agnesdeng commented 3 years ago

Hi Rex and Asheetal,

Imputing new unseen data using a previously saved imputer with training data is now available for our package mixgb. Please check https://github.com/agnesdeng/mixgb for more details. Thanks again for your feedback.

asheetal commented 3 years ago

Awesome. I will post feedback in a few days

asheetal commented 3 years ago

I see newdata implemented with midae and mivae on the comments. Do you have any recommendation on when to choose one implementation over another? I have a dataset with 2 million observations and over 1000 columns that I wish to impute

agnesdeng commented 3 years ago

Hi Asheetal,

I am still waiting to see some simulation results regards to the imputation performance of midae vs mivae. Sorry that I can't make any recommendations for now, but I notice that the performance also generally depends on some hyperparameters.

I've made some functions to generate diagnostic plots so that users can assess whether these hyperparameters give an acceptable performance before using them for future imputations. More features will be added to the package, and I will keep it updated.