NIEHS / PrestoGP

Penalized Regression on Spatiotemporal Outcomes using Gaussian Processes a.k.a. PrestoGP
https://niehs.github.io/PrestoGP/
0 stars 0 forks source link

Imputation functionality #21

Open ericbair-sciome opened 7 months ago

ericbair-sciome commented 7 months ago

Functions to impute data (due to limit of detection or otherwise) need to be added to the package.

kyle-messier commented 2 months ago

@brian-bk22 @ericbair-sciome The overall functionality was added with PR #56 . For the pesticide work, we need to allow for variable LOD. Currently it appears that each outcome can only have 1 LOD: Error in check_input(model, Y, X, locs, center.y, impute.y, lod) : Each lod must have length 1

Also, we don't have to implement this now if it is not trivial, but I think a more standard or straight forward approach to implement an LOD would be to have a single value for the Y and then an indicator for whether it is observed or a limit of detection.

ericbair-sciome commented 2 months ago

I had been meaning to ask you about that. I thought you had said something about that in one of our meetings, but I wasn't sure. I'll go ahead and change this. It should be an easy fix.

By the way, I have finished my more detailed testing of the imputation algorithm. It seems to work very well for MAR missingness, but there is some bias in the LOD case. (And it gets steadily worse as the proportion of missing data increases.) I'm hoping to send a new version to Shail tonight. (Everything is done other than documentation at this point plus the aforementioned change, which should be easy.) My plan was to fix the show/accessor methods after that (since that is important for model interpretation) and then double back to see if we can figure out a way to improve the LOD imputation.

kyle-messier commented 2 months ago

@ericbair-sciome Thanks for the quick reply and fix. That is good to hear it will be an easy fix. For the LOD imputation effectiveness, also good to hear it is working for random case. It is expected that it will get worst as the proportion of missingness increases. One thing I mentioned in an email was the idea of multiple imputation. I'm developing the pesticide analysis through the targets package, which makes mapping over parameters easy and reproducible. We'll see how the analysis turns out, but we could discuss the idea of fitting k different PrestoGP models where each imputation is allowed to vary and average over those.

As an example, here is the current working version of the visualization of the targets pipeline. I have pre-processing, exploratory analysis, testing on a vanilla glmnet model, sub-sample dataset for testing PrestoGP, which is where it is currently failing:

image

kyle-messier commented 2 months ago

@ericbair-sciome Also, question related to the scaling input in prestogp_fit. Does it need to be a list for each outcome? Or would something like c(1,1,2) be sufficient for the multivariate. I think the latter is fine. I can't imagine a scenario where you would allow each outcome to vary in terms of a spatial vs spatiotemporal model.

ericbair-sciome commented 2 months ago

No, there is one scaling input for all outcomes for basically the reason you just said. :)

ericbair-sciome commented 2 months ago

@brian-bk22 @ericbair-sciome The overall functionality was added with PR #56 . For the pesticide work, we need to allow for variable LOD. Currently it appears that each outcome can only have 1 LOD: Error in check_input(model, Y, X, locs, center.y, impute.y, lod) : Each lod must have length 1

Also, we don't have to implement this now if it is not trivial, but I think a more standard or straight forward approach to implement an LOD would be to have a single value for the Y and then an indicator for whether it is observed or a limit of detection.

As an FYI, in the latest version, each outcome can have a separate LOD. Let me know if the syntax is unclear and I will try to improve the documentation.

I'm going to keep this issue open for now. While imputation is implemented in the current release, it seems to be biased when the percentage of missing data due to LOD is very high. We are working on some alternative approaches that seem to work better. If all goes well, we should have an improved imputation algorithm implemented in the next few weeks.