Batch Correction Plugin

JasonWReeves commented 3 years ago

Inputs

[x] Annotation Column containing batch information
[x] Secondary test annotation column to confirm batch correction won't confound test variables
[x] threshold for colinearity test

Standard controls

[x] font controls
[x] color controls

Requirements

[x] The plugin shall perform batch correction based on an annotation column provided by the user
- [x] spec: the plugin shall use log2 normalized count data
- [x] spec: the plugin shall use a mixed effect model to estimate batch specific effects
- [x] spec: the batch corrected data shall consist of the residuals from the model based on the annotation provided
[x] The plugin shall output the batch corrected residual data as a new dataset into the DSP DA environment
[x] The plugin shall output an excel document with multiple tabs representing QC metrics for the analysis
[x] The plugin shall QC whether the test variables are confounded by the batch variable
- [x] spec: the model matrix for the test shall be tested for colinearity
- [x] spec: the test shall output a tab within the excel showing the colinearity and flagging test variables that are confounded
[x] The plugin shall QC the dataset before and after batch correction to confirm removal of batch effects
- [x] spec: the plugin shall use PCA for QC
- [x] spec: The plugin shall calculate either the first 3 PCs or PCs that capture more than 5% of variance each
- [x] spec: The plugin shall use an ANOVA of the above PCs to determine if any are significantly associated with batch
- [x] spec: the QC tab shall list the p-values of the ANOVA before and after batch correction, as well as the % variance explained
[x] The plugin shall output a PCA plot showing the first 2 PCs before and after correction
- [x] spec: the plugin shall output a PCA colored by batch
- [x] spec: the plot shall contain on the left a graph before batch correction, and on the right a graph after batch correction
- [x] spec: the plots shall have titles indicating they are before or after batch correction
- [x] spec: The axes of the graphs should note the PC # and % Variance
[x] The plugin shall have font (family and size) and color definitions coded so that they may be changed by the user

JasonWReeves commented 3 years ago

Anticipated steps for data processing:

Create a model matrix for the annotations and test for colinearity
Calculate LMM based on provided annotations and save residuals as secondary dataset
Calculate PCA on original and new dataset
Run ANOVA on PCA vs batch annotations
Plot PCs and note ANOVA p-values
Save the new dataset, PCA plots, and QC metrics

eveilyeverafter commented 3 years ago

I think we need to gather the requirements for this issue.

JasonWReeves commented 3 years ago

@dnadave can you review & harden any of the above requirements, and provide any appropriate code per the colinearity testing that you think would be useful.

eveilyeverafter commented 3 years ago

For scope, should this plugin work with protein, RNA, and protein NGS or a subset of these?

dnadave commented 3 years ago

I believe we should start with just RNA for now.

On Wed, Mar 3, 2021 at 9:14 AM Tyler Hether notifications@github.com wrote:

For scope, should this plugin work with protein, RNA, and protein NGS or a subset of these?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Nanostring-Biostats/DSPPlugins/issues/16#issuecomment-789899508, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVFG4WU6KAQPIL5KZEB63TBZVARANCNFSM4SWJXYWA .

-- David Henderson, Ph.D. 18476 47th Place NE Lake Forest Park, WA 98155 206-794-8552

eveilyeverafter commented 3 years ago

I have a question regarding the collinear testing for batch correction. There are a few different ways we could go about this.

We can compare the batching factor to one or more factors of biological interest. For example, we can compare values straight from the annotations data (e.g., slide ~ tumor_status where slide is the batching factor). If the user provides multiple biological factors of interest we can compute VIFs.
Another approach would be to run PCA on the expression matrix (centered, scaled) and compare the first few PC values (independent variable) to one or more biologically interesting factors (dependent variables). We could compute VIFs that way to see if any biologically interesting factors are collinear with batch. We would do this twice: with the log2 Q3 expression matrix and with the batch-corrected expression matrix. Then we can compare those two models to check if 1) we have collinearity and 2) if it improves or gets worse after applying batch correction.
Other possibilities I haven't thought of?

dnadave commented 3 years ago

Biostats was just discussing this with the software team this afternoon. Having a colinearity check function would be very handy in general, with the batch correction function as a special case. You would need to have your table group factors together as you will get a VIF for each level included in the model (remember, you'll have to drop one level for each factor when calculating VIFs).

I think this approach would be really handy to have in general and very important to check before you correct for a batch factor.

On Tue, Mar 9, 2021 at 2:50 PM Tyler Hether @.***> wrote:

I have a question regarding the collinear testing for batch correction. There are a few different ways we could go about this.

We can compare the batching factor to one or more factors of biological interest. For example, we can compare values straight from the annotations data (e.g., slide ~ tumor_status where slide is the batching factor). If the user provides multiple biological factors of interest we can compute VIFs.

Another approach would be to run PCA on the expression matrix (centered, scaled) and compare the first few PC values (independent variable) to one or more biologically interesting factors (dependent variables). We could compute VIFs that way to see if any biologically interesting factors are collinear with batch. We would do this twice: with the log2 Q3 expression matrix and with the batch-corrected expression matrix. Then we can compare those two models to check if 1) we have collinearity and 2) if it improves or gets worse after applying batch correction.

Other possibilities I haven't thought of?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Nanostring-Biostats/DSPPlugins/issues/16#issuecomment-794564922, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVFG677DF3ZLWIPPPDYWTTC2J3XANCNFSM4SWJXYWA .

-- David Henderson, Ph.D. 18476 47th Place NE Lake Forest Park, WA 98155 206-794-8552

eveilyeverafter commented 3 years ago

VIFs are associated with independent variables so they are invariant when you compare before/after batch correction models. In that spirit, however, I have added a correlation cutoff to flag the user if batch is correlated with one or more of their biological factors of interest. If a given biologically interesting factor is completely correlated with batch, it gives the user an additional warning.

The plugin computes a series of linear models comparing PC scores (the first 3) with batch and generates P-values and adjusted R-squared values before and after batch correction.

I would like to test it internally with other datasets before submitting a PR.

eveilyeverafter commented 2 years ago

Returning to the batch correction plug-in. Taking in the above considerations, I'm proposing the following requirements and specifications. Main changes indicated in boldface.

Req. The plugin shall perform batch correction based on an annotation column provided by the user
- spec: the plugin shall use log2 normalized count data
- spec: the plugin shall use a mixed effect model to estimate batch-specific effects based on that annotation provided.
- spec: the batch corrected data shall consist of the residuals shifted from the model center for each feature
Req. The plugin shall output the batch corrected data as a new dataset into the DSP DA environment
Req. The plugin shall output an excel document with multiple tabs representing QC metrics for the analysis
Req. The plugin shall QC whether the test variables are confounded by the batch variable
- spec: the model matrix for the test shall be tested for colinearity
- spec: the test shall output a tab within the excel showing the colinearity and flagging test variables that are confounded given the user-specified correlation threshold.
Req. The plugin shall QC the dataset before and after batch correction to confirm removal of batch effects
- spec: the plugin shall use PCA for QC
- spec: The plugin shall calculate the first 5 PCs.
- spec: The plugin shall use an ANOVA of the above PCs to determine if any are significantly associated with batch
- spec: the QC tab shall list the p-values of the ANOVA before and after batch correction, as well as the adjusted R^2 value.
Req. The plugin shall output a PCA plot showing the first 2 PCs before and after correction
- spec: the plugin shall output a PCA colored by batch
- spec: the plot shall contain on the left a graph before batch correction, and on the right a graph after batch correction
- spec: the plots shall have titles indicating they are before or after batch correction
- spec: The axes of the graphs should note the PC # and % Variance
Req. The plugin shall have font (family and size) and color definitions coded so that they may be changed by the user

The shifted residuals change above is the biggest change in the list and was reviewed by Lei before she left. Instead of returning the residuals from each feature-based model, it multiplies the residuals by the intercept. This way, the magnitude of the values is more similar to the expression value for that feature and the values are no longer centered at 0. This doesn't quantitatively affect PCA metrics used in the plug-in itself but will affect any downstream analyses. For example, cell deconvolution cannot handle data centered at 0.

Adding @maddygriz.

JasonWReeves commented 2 years ago

@tylerhether these look good to me

Nanostring-Biostats / DSPPlugins

Batch Correction Plugin #16