LOD imputation magnifies bias when the proportion of censored data is high

ericbair-sciome commented 1 month ago

The LOD imputation method fits a model to the observed data and then uses this model to estimate the values for the unobserved data. In our experiments, this seems to produce approximately unbiased results when an unbiased regression method is used to impute the missing values. The issue is that PrestoGP uses a (biased) glmnet regression model. When we use a biased regression method, the result is that our imputed estimates for the missing data are also biased, so when we fit a model to the fully imputed data set, it is even more biased.

Honestly, I am not sure if this something we even need to worry about. The whole reason we are using a biased regression model is because we believe that the reduction in variance we get from using penalized regression will offset the biased regression estimates. On real world data sets, the biased imputation estimates are probably the best we can do. And we can always adjust the lambda parameter if we want to control the tradeoff between bias and variance.

Still, I thought it was worth noting this issue as a possible future research area. We might be able to reduce or fix the issue by choosing the lambda parameter more carefully and/or using some version of the relaxed lasso (see #24 ).

ericbair-sciome commented 3 weeks ago

@kyle-messier do we have a sense of what proportion of the data is likely to be censored in the pesticide data? As I recall, you said it might be up to 80-90% for some outcomes at one point. In my experiments, the bias is fairly small with up to 50-60% of the data censored, but it starts to get ugly once you are at 80-90%. I have some ideas about how to possibly improve that, but if 50% or less of the data are censored, then what we have is probably fine.

kyle-messier commented 3 weeks ago

@ericbair-sciome The current pesticides are 68 - 92% censored, which is all in the high to very high range. I can investigate using other pesticides. I have a pipeline that should make getting the dataset up to speed relatively tractable.

ericbair-sciome commented 1 week ago

I think this issue is mostly fixed in the latest version. In my experiments on simulated data, there was practically no bias for up to 50-60% censored data, and usually the bias was 10% or less even with 90% censored data. And the running time isn't too bad, either. We may have finally solved this censoring puzzle. Woo-hoo!

The one caveat is that the results seem to depend heavily on the initial beta estimates. If you choose bad initial estimates, the imputation procedure can converge extremely slowly (or not at all). I have some code to choose initial beta's if they are not specified by the user, and it generally works pretty well on my simulations. But who knows if these simulations are an accurate approximation of reality. On the pesticide data, we should probably see how much the results change for different initial beta estimates. But that should be easy enough to do.

kyle-messier commented 1 week ago

@ericbair-sciome This is great work! One other question on current functionality - via issue #29 - given the importance of beta start values, are we still fitting 1 M-Gaussian model for all of the outcomes or are we allowing separate fits of the betas for each outcome? Overall, this might improve model fit a lot, especially if there outcomes are quite different in terms of their betas. What do you think?

ericbair-sciome commented 1 week ago

Sorry for the slowish reply. Currently, we are fitting a single univariate glmnet model to the "super matrix" of the X's (basically, it's a block diagonal matrix where each block corresponds to the X's for a given outcome). We had originally planned to do a multivariate glmnet model, but it isn't obvious how to do the iid transformation in that case.

I experimented with fitting a separate glmnet model to each outcome, but I found that it did more harm than good. (See #33). From what I could tell, any improvement from fitting the models separately was more than offset by the increase in variance resulting from fitting the model using only a fraction of your data. As I mentioned in that issue, maybe we can revisit at some point to see if there any situations where that is worth doing, but my inclination is to leave it as is for the time being.

kyle-messier commented 1 week ago

Sounds good @ericbair-sciome - thank you for the explanation and apologize if you had to repeat yourself.

NIEHS / PrestoGP

LOD imputation magnifies bias when the proportion of censored data is high #71