Open JasonWReeves opened 3 years ago
Anticipated steps for data processing:
I think we need to gather the requirements for this issue.
@dnadave can you review & harden any of the above requirements, and provide any appropriate code per the colinearity testing that you think would be useful.
For scope, should this plugin work with protein, RNA, and protein NGS or a subset of these?
I believe we should start with just RNA for now.
On Wed, Mar 3, 2021 at 9:14 AM Tyler Hether notifications@github.com wrote:
For scope, should this plugin work with protein, RNA, and protein NGS or a subset of these?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Nanostring-Biostats/DSPPlugins/issues/16#issuecomment-789899508, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVFG4WU6KAQPIL5KZEB63TBZVARANCNFSM4SWJXYWA .
-- David Henderson, Ph.D. 18476 47th Place NE Lake Forest Park, WA 98155 206-794-8552
I have a question regarding the collinear testing for batch correction. There are a few different ways we could go about this.
slide ~ tumor_status
where slide
is the batching factor). If the user provides multiple biological factors of interest we can compute VIFs.Biostats was just discussing this with the software team this afternoon. Having a colinearity check function would be very handy in general, with the batch correction function as a special case. You would need to have your table group factors together as you will get a VIF for each level included in the model (remember, you'll have to drop one level for each factor when calculating VIFs).
I think this approach would be really handy to have in general and very important to check before you correct for a batch factor.
On Tue, Mar 9, 2021 at 2:50 PM Tyler Hether @.***> wrote:
I have a question regarding the collinear testing for batch correction. There are a few different ways we could go about this.
- We can compare the batching factor to one or more factors of biological interest. For example, we can compare values straight from the annotations data (e.g., slide ~ tumor_status where slide is the batching factor). If the user provides multiple biological factors of interest we can compute VIFs.
- Another approach would be to run PCA on the expression matrix (centered, scaled) and compare the first few PC values (independent variable) to one or more biologically interesting factors (dependent variables). We could compute VIFs that way to see if any biologically interesting factors are collinear with batch. We would do this twice: with the log2 Q3 expression matrix and with the batch-corrected expression matrix. Then we can compare those two models to check if 1) we have collinearity and 2) if it improves or gets worse after applying batch correction.
- Other possibilities I haven't thought of?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Nanostring-Biostats/DSPPlugins/issues/16#issuecomment-794564922, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVFG677DF3ZLWIPPPDYWTTC2J3XANCNFSM4SWJXYWA .
-- David Henderson, Ph.D. 18476 47th Place NE Lake Forest Park, WA 98155 206-794-8552
VIFs are associated with independent variables so they are invariant when you compare before/after batch correction models. In that spirit, however, I have added a correlation cutoff to flag the user if batch is correlated with one or more of their biological factors of interest. If a given biologically interesting factor is completely correlated with batch, it gives the user an additional warning.
The plugin computes a series of linear models comparing PC scores (the first 3) with batch and generates P-values and adjusted R-squared values before and after batch correction.
I would like to test it internally with other datasets before submitting a PR.
Returning to the batch correction plug-in. Taking in the above considerations, I'm proposing the following requirements and specifications. Main changes indicated in boldface.
The shifted residuals change above is the biggest change in the list and was reviewed by Lei before she left. Instead of returning the residuals from each feature-based model, it multiplies the residuals by the intercept. This way, the magnitude of the values is more similar to the expression value for that feature and the values are no longer centered at 0. This doesn't quantitatively affect PCA metrics used in the plug-in itself but will affect any downstream analyses. For example, cell deconvolution cannot handle data centered at 0.
Adding @maddygriz.
@tylerhether these look good to me
Inputs
Standard controls
Requirements