Closed cvanderaa closed 4 months ago
After exploring the statistical inference implementation, I realised that I wrongly interpreted the coefficients when factors are encoded as sum contrasts. Previously, when one of the groups is encoded as the "reference" (in fact, there is no longer a reference group with sum contrast, but it is encoded as -1 in all corresponding variables of the model matrix), I simply computed the logFC as 2x the coefficient associated to the second group to compare to. This approach only works when there are two groups to compare.
Instead, and based on Lieven's suggestion (cf his course material), I now use a contrast matrix $L$, such that:
$logFC_{Mock1-Mock2} = L^T \hat{\beta}$
and
$var (logFC) = L^T \Sigma L$
where $\Sigma$ is the variance-covariance matrix. The contrast matrix is build based on the available levels for each peptide so that the coefficient are correctly matched to the desired group contrast.
Let's explore with the new implementation using the same script as in the initial comment. The volcano plot becomes
And the estimated logFC becomes
APNVVVTR
-0.002480005
which is very close to the empirical value mentioned in the comment above.
This approach is implemented since https://github.com/UCLouvain-CBIO/scp/commit/6b340848afe6ad3214b2c96b495e0aa234c082db.
I consider this issue as solved unless @lgatto you have additional comments/suggestions?
Hello,
I believe I am using the version (SHA1: 610cdffa) in which this issue has been corrected. However, when I am using the scpDifferentialAnalysis() function, there seemed to be some mistakes in the result. When I manually compute the mean as mentioned above and compared that with the estimates generated from the scpDifferentialAnalysis() function, there is a discrepancy. In some extreme cases, I can see when a protein that should've had higher abundance in one group is shown to have higher abundance in the opposite group. I have four different groups in my data. I tried scpDifferentialAnalysis(sce, contrast = list(c("SampleType", "group1", "group1")) as a step to troubleshoot the issue, I am getting a volcano plot that shows significant differential expression, which is not what I expect to see.
Hello @shimin-chen,
Thanks again for reaching out. Could you please provide a minimal reproducible example so that I can use your example as a test case for debugging? For instance, you can provide a QFeatures object with one of the peptides for which you see problematic estimation as well as the code that runs the model and computes the mean.
Meanwhile, here are a few comments:
SampleType
with other variables, make sure to compare means on data with only SampleType
effect, that is after using scpKeepEffect()
.Hi Christophe,
Once again, I appreciate your response and continued support and development on this tool.
I tried to come up with a QFeatures object with just the problematic peptides using sce1 <- sce[c(peptide1, peptide2), , ]
. Unfortunately, I got an error subscript contains invalid names
when trying to run scpDifferentialAnalysis(sce1, contrast = list(c("SampleType", "group1", "group2")))
. Please let me know if you have suggestions on how to generate the QFeatures object with just the problematic peptides.
To respond to your comments-
yes, I checked again using devtools::package_info("scp")
and indeed it return
scp * 1.13.2 2024-02-26 [1] Github (UCLouvain-CBIO/scp@610cdff)
I used the exact same code as shown in your reproducible example above, with the necessary modifications on the peptide sequence and the column name for group information. I did use scpKeepEffect()
before calculating mean.
I could be wrong about this - I felt that the issue may have something to do with the contrast matrix resulting from having more than 2 groups in SampleType (I have 4 groups in total).
Please let me know if I can provide more information for debugging.
Again, thanks for pointing out a new issue! I opened an issue here #58 to solve this. Since I think it will take us some time to solve this, would it be ok for you to share your full data set? Depending on its size, you may need to share it through an external server like GDrive, Dropbox, OneDrive, etc. Then, could you also provide the minimal code you used to highlight your issue? Even if it is almost copy-pasted from my comment, it will avoid me guess work :wink:
If your data is confidential and you are not allowed to share it, could you try to first subset sce1
, then run ScpModelWorkflow()
and then see if you can repeat the problem? If yes, you can send me that subset data set.
I managed to reproduce the issue with a subsetted sce object to be used before running the ScpModelWorkflow()
. Below is the code for reproducing the issue. I changed the SampleType names to show only the first letter.
library(scp)
library(ggplot2)
load("sce_debugging3.RData")
# modeling
sce <- scpModelWorkflow(
sce1,
formula = ~ 1 +
MedianIntensity +
File.ID +
SampleType
)
# Differential analysis
group1 <- "P"
group2 <- "D"
daRes <- scpDifferentialAnalysis(
sce, contrast = list(c("SampleType", group1, group2))
)
scpVolcanoPlot(daRes)
# Calculate mean
i <- "[K].aTGPPVSELITk.[A]"
y <- assay(scpKeepEffect(sce, "SampleType"))[i, ]
type <- sce$SampleType
(means <- sapply(split(y, type), mean, na.rm = TRUE))
# show different results
means[[group1]] - means[[group2]]
daRes[[1]]$Estimate[daRes[[1]]$feature == i]
I get this result
> means[[group1]] - means[[group2]]
[1] -0.3273939
> daRes[[1]]$Estimate[daRes[[1]]$feature == i]
[K].aTGPPVSELITk.[A]
0.8456961
Interestingly, when I reorder the factor level in SampleType, the calculation is correct.
library(scp)
library(ggplot2)
load("sce_debugging3.RData")
# change the order of factor level
library(stringr)
sce1$SampleType <- factor(sce1$SampleType, levels = c("D", "H", "L", "P"))
# modeling
sce <- scpModelWorkflow(
sce1,
formula = ~ 1 +
MedianIntensity +
File.ID +
SampleType
)
# Differential analysis
group1 <- "P"
group2 <- "D"
daRes <- scpDifferentialAnalysis(
sce, contrast = list(c("SampleType", group1, group2))
)
scpVolcanoPlot(daRes)
# Calculate mean
i <- "[K].aTGPPVSELITk.[A]"
y <- assay(scpKeepEffect(sce, "SampleType"))[i, ]
type <- sce$SampleType
(means <- sapply(split(y, type), mean, na.rm = TRUE))
# show results
means[[group1]] - means[[group2]]
daRes[[1]]$Estimate[daRes[[1]]$feature == i]
I seem to be getting the correct result this time
> means[[group1]] - means[[group2]]
[1] -0.3273944
> daRes[[1]]$Estimate[daRes[[1]]$feature == i]
[K].aTGPPVSELITk.[A]
-0.3274025
If I compare the same group:
# Differential analysis
group1 <- "P"
group2 <- "P"
I get:
> means[[group1]] - means[[group2]]
[1] 0
> daRes[[1]]$Estimate[daRes[[1]]$feature == i]
[K].aTGPPVSELITk.[A]
0.6724825
This is very helpful! I was able to reproduce your error and found a nasty little bug in the code that, as you have noticed, incorrectly assigned the levels when building contrasts.
This is fixed since https://github.com/UCLouvain-CBIO/scp/commit/43490720345757957289ddb14220f3937f879490
So, many thanks again for pointing this out and for providing an example that facilitated debugging. My apologies for the inconvenience :pray:
Fantastic! Thank you so much for the prompt response and actions on this issue 😀 Hope you have a great weekend!
While testing the scplainer's differential analysis approach on data with mock biological labels (i.e. I artificially created 2 groups within the same cell type), I noticed there is something wrong when the biological variable has more than 2 groups (e.g. melanoma, mock1, mock2).
Reproducible example
Load data
Assign mock labels to one of the cell types (monocytes). The mock label is randomly assigned within each MS batch.
Model and analyse with mock labels
The problem
While we expect no differential peptides, the volcano plot reports a few strongly differentially expressed peptides.
Let's explore the most differentially expressed peptide. After keeping only the biological effect (with mock), I manually compute the mean of each group:
The difference between the Mock1 and Mock2 groups is ~ -0.0025, but the computed logFC is
This is far from the expected value... (to be continued)