Nanostring-Biostats / DSPPlugins

Repository for DSP Plugins
10 stars 4 forks source link

Batch Correction Plugin #16

Open JasonWReeves opened 3 years ago

JasonWReeves commented 3 years ago

Inputs

Standard controls

Requirements

JasonWReeves commented 3 years ago

Anticipated steps for data processing:

  1. Create a model matrix for the annotations and test for colinearity
  2. Calculate LMM based on provided annotations and save residuals as secondary dataset
  3. Calculate PCA on original and new dataset
  4. Run ANOVA on PCA vs batch annotations
  5. Plot PCs and note ANOVA p-values
  6. Save the new dataset, PCA plots, and QC metrics
eveilyeverafter commented 3 years ago

I think we need to gather the requirements for this issue.

JasonWReeves commented 3 years ago

@dnadave can you review & harden any of the above requirements, and provide any appropriate code per the colinearity testing that you think would be useful.

eveilyeverafter commented 3 years ago

For scope, should this plugin work with protein, RNA, and protein NGS or a subset of these?

dnadave commented 3 years ago

I believe we should start with just RNA for now.

On Wed, Mar 3, 2021 at 9:14 AM Tyler Hether notifications@github.com wrote:

For scope, should this plugin work with protein, RNA, and protein NGS or a subset of these?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Nanostring-Biostats/DSPPlugins/issues/16#issuecomment-789899508, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVFG4WU6KAQPIL5KZEB63TBZVARANCNFSM4SWJXYWA .

-- David Henderson, Ph.D. 18476 47th Place NE Lake Forest Park, WA 98155 206-794-8552

eveilyeverafter commented 3 years ago

I have a question regarding the collinear testing for batch correction. There are a few different ways we could go about this.

dnadave commented 3 years ago

Biostats was just discussing this with the software team this afternoon. Having a colinearity check function would be very handy in general, with the batch correction function as a special case. You would need to have your table group factors together as you will get a VIF for each level included in the model (remember, you'll have to drop one level for each factor when calculating VIFs).

I think this approach would be really handy to have in general and very important to check before you correct for a batch factor.

On Tue, Mar 9, 2021 at 2:50 PM Tyler Hether @.***> wrote:

I have a question regarding the collinear testing for batch correction. There are a few different ways we could go about this.

  • We can compare the batching factor to one or more factors of biological interest. For example, we can compare values straight from the annotations data (e.g., slide ~ tumor_status where slide is the batching factor). If the user provides multiple biological factors of interest we can compute VIFs.
  • Another approach would be to run PCA on the expression matrix (centered, scaled) and compare the first few PC values (independent variable) to one or more biologically interesting factors (dependent variables). We could compute VIFs that way to see if any biologically interesting factors are collinear with batch. We would do this twice: with the log2 Q3 expression matrix and with the batch-corrected expression matrix. Then we can compare those two models to check if 1) we have collinearity and 2) if it improves or gets worse after applying batch correction.
  • Other possibilities I haven't thought of?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Nanostring-Biostats/DSPPlugins/issues/16#issuecomment-794564922, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVFG677DF3ZLWIPPPDYWTTC2J3XANCNFSM4SWJXYWA .

-- David Henderson, Ph.D. 18476 47th Place NE Lake Forest Park, WA 98155 206-794-8552

eveilyeverafter commented 3 years ago

VIFs are associated with independent variables so they are invariant when you compare before/after batch correction models. In that spirit, however, I have added a correlation cutoff to flag the user if batch is correlated with one or more of their biological factors of interest. If a given biologically interesting factor is completely correlated with batch, it gives the user an additional warning.

The plugin computes a series of linear models comparing PC scores (the first 3) with batch and generates P-values and adjusted R-squared values before and after batch correction.

I would like to test it internally with other datasets before submitting a PR.

eveilyeverafter commented 2 years ago

Returning to the batch correction plug-in. Taking in the above considerations, I'm proposing the following requirements and specifications. Main changes indicated in boldface.

The shifted residuals change above is the biggest change in the list and was reviewed by Lei before she left. Instead of returning the residuals from each feature-based model, it multiplies the residuals by the intercept. This way, the magnitude of the values is more similar to the expression value for that feature and the values are no longer centered at 0. This doesn't quantitatively affect PCA metrics used in the plug-in itself but will affect any downstream analyses. For example, cell deconvolution cannot handle data centered at 0.

Adding @maddygriz.

JasonWReeves commented 2 years ago

@tylerhether these look good to me