SteffenMoritz / imputeR

CRAN R package: Impute missing values based on automated variable selection
GNU General Public License v3.0
15 stars 3 forks source link

Limited range imputation #4

Closed dchiu911 closed 2 years ago

dchiu911 commented 2 years ago

How do we impute a variable which has a limited range (e.g., non-negative values only)?

SteffenMoritz commented 2 years ago

Hello @dchiu911

depends on what you are trying to do with your imputation.

The easiest solution would be to perform the imputation and then do some manual post-processing. (e.g. set all values above / under your bounds to the max. allowed value)

This has the disadvantage, that (in case you have a considerable amount out of bounds occurrences) there will be a peak in the distribution at the max/min values. Because suddenly you set multiple values to max/min.

To mitigate this problem you could instead of setting this values exactly to the min/max distribute the out-of-bounds according to some distribution function at the edges.

You could also try to use e.g. from the mice R package the Predictive mean matching (pmm) algorithm option. This algorithm only imputes values which are present in the existing dataset. Imagine it a little bit like a nearest neighbor algorithm - one of the n closest neighbors to the missing values is selected to be the "donor" / to be the replacement for the missing values. (it should be obvious that the strength - that only existing values are considered for imputation - is also at the same time the downside of the algorithm)

Here is an interesting read about the problem in general: Comparison of methods for imputing limited-range variables: a simulation study ( https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-14-57 )

I'll also link you an interesting StackExchange discussion about the topic: https://stats.stackexchange.com/questions/78632/multiple-imputation-for-missing-values

There are also some other ways to deal with this (not mentioned here or in the paper). In general it would make perfectly sense to somehow model this as kind of prior knowledge into the imputation algorithm itself. But there till now there is no implementation to this, which 100% convinced me. (happy to know about implementations others like)

I'd say there is not perfect solution to the problem. As mentioned before, what methods make most sense depends also on your use case.

If you just use the imputations as a preprocessing step for ML predictions you could try the easy solution of just setting everything to min/max. Maybe afterwards try mice/pmm. Maybe also run some other solutions. You just evaluate then, how the different solutions alter the results of your ML prediction and choose the solution with the best results for your ML model cross validation. In this case it's quite easy, because it is logical to the the preprocessing, which gives the best results in the end. I mean if the predictions are better even the preprocessed data with the out-of-bounds data could make sense.

The issue is way more complicated / relevant and needs a way closer look, if you use the imputation as a preprocessing step (to avoid bias) in your statistical analysis.

dchiu911 commented 2 years ago

Hi @SteffenMoritz thank you for the detailed response. We are using the result of the imputation in a dynamic treatment regime model with survival outcomes (DTRreg::DWSurv()). The function removes observations with missing values in any of the input variables, so we wanted to impute first in order to retain as many cases as possible.

I am using "lassoR" for lmFun and "randomForest" for cFun, and one of the continuous variables had negative imputed values when it is only defined for non-negative values. There are only 2/400~ cases that are negative, so I think if we manual post-processed those cases (e.g., round to the observed minimum) the amount fo bias introduced would be low? PMM could also work as well.

SteffenMoritz commented 2 years ago

If you anyway have enough cases and the missing data is MCAR (Missing completely at random) using only complete cases (complete case analysis) could also be an idea.

But beware, if the missing values are not MCAR, e.g. there are more missings for patients with certain characteristics you introduce bias by only looking at complete cases.

So imputation itself seems reasonable for your problem. About the out-of-bounds data - yes, I guess if it's only 2/400 cases it is justifiable to just post-process them manually. Maybe also take a look at these two cases if there is something special to them in comparison to the others.

In general I can recommend single-imputation methods like provided from packages like imputeR, missForest, and others more or less fully as preprocessing for ML learning algorithms or when you have very few missing values.

For summary statistics, survey analysis, ... also multiple imputation methods might be worth a look. The idea behind this is, that with the imputation comes some uncertainty. Since the imputed values are of course only estimated. For example if you impute age with imputeR you fill in e.g. 28 as a best guess, but judging from the data maybe you could rather say "the age is probably around 23 - 33". You have some kind of probability distribution for your imputed values ... with 28 the most likely ... but could also be a little bit different.

To somehow give an estimate for this uncertainty you can use multiple imputations methods. (e.g. with mice-pmm you can do multiple imputation)

The package will basically give you back multiple different imputed datasets, which are basically drawn from this probability distribution. So after all after using this methods you have e.g. 20 different imputed datasets.

Now you could perform your analysis with DTRreg::DWSurv() twenty times, for each of this 20 datasets separately. After having 20 results, you can compare, how much these results actually differ. (this is basically the uncertainty introduced by the imputation).

Of course this also very much complicates things ... and plenty of people completely misunderstand why you are doing the multiple imputation. Additionally, you also have to think about, how you pool these 20 results back together and present them as end results (this is different depending on the analysis you are doing). Sometimes it makes sense to report the average of the results and give the standard deviation. Sometimes there is anyway no real difference and it makes sense to just report 1 result and state that all other datasets were similar. For other analysis other aggregates make sense ...