Incorrect inference with scplainer

UCLouvain-CBIO / scp

Single cell proteomics data processing

https://uclouvain-cbio.github.io/scp/index.html

19 stars 2 forks source link

Incorrect inference with scplainer #56

Closed cvanderaa closed 4 months ago

cvanderaa commented 5 months ago

While testing the scplainer's differential analysis approach on data with mock biological labels (i.e. I artificially created 2 groups within the same cell type), I noticed there is something wrong when the biological variable has more than 2 groups (e.g. melanoma, mock1, mock2).

Reproducible example

Load data

library("scp")
library("ggplot2")
## Get the data by running "./scripts/2_modelling/leduc2022_pSCoPE.R"
sce <- readRDS("./data/leduc2022_pSCoPE_modelled.rds")

Assign mock labels to one of the cell types (monocytes). The mock label is randomly assigned within each MS batch.

mock <- sce$SampleType
mock <- split(mock, sce$Set)
set.seed(1234)
mock <- lapply(mock, function(labels) {
    i <- which(labels == "Monocyte")
    i1 <- sample(i, length(i) / 2)
    i2 <- i[!i %in% i1]
    labels[i1] <- "Mock1"
    labels[i2] <- "Mock2"
    labels
})
sce$Mock <- unname(do.call(c, mock))
## Subsample for faster run
set.seed(1234)
sce <- sce[sample(seq_len(nrow(sce)), 500), ]

Model and analyse with mock labels

sce <- scpModelWorkflow(
    sce,
    formula = ~ 1 + ## intercept
        ## normalization
        MedianIntensity +
        ## batch effects
        Channel + Set + 
        ## biological variability
        Mock
)
res <- scpDifferentialAnalysis(
    sce,
    contrasts = list(c("Mock", "Mock1", "Mock2"))
)
scpVolcanoPlot(res)

The problem

While we expect no differential peptides, the volcano plot reports a few strongly differentially expressed peptides.

Let's explore the most differentially expressed peptide. After keeping only the biological effect (with mock), I manually compute the mean of each group:

i <- "APNVVVTR"
y <- assay(scpKeepEffect(sce, "Mock"))[i, ]
type <- sce$Mock
(means <- sapply(split(y, type), mean, na.rm = TRUE)

  Melanoma      Mock1      Mock2 
-0.8228368  0.4101952  0.4126721

The difference between the Mock1 and Mock2 groups is ~ -0.0025, but the computed logFC is

res[[1]]$Estimate[res[[1]]$feature == i]

 APNVVVTR 
0.8203617

This is far from the expected value... (to be continued)

cvanderaa commented 5 months ago

After exploring the statistical inference implementation, I realised that I wrongly interpreted the coefficients when factors are encoded as sum contrasts. Previously, when one of the groups is encoded as the "reference" (in fact, there is no longer a reference group with sum contrast, but it is encoded as -1 in all corresponding variables of the model matrix), I simply computed the logFC as 2x the coefficient associated to the second group to compare to. This approach only works when there are two groups to compare.

Instead, and based on Lieven's suggestion (cf his course material), I now use a contrast matrix $L$, such that:

$logFC_{Mock1-Mock2} = L^T \hat{\beta}$

and

$var (logFC) = L^T \Sigma L$

where $\Sigma$ is the variance-covariance matrix. The contrast matrix is build based on the available levels for each peptide so that the coefficient are correctly matched to the desired group contrast.

Updated figures

Let's explore with the new implementation using the same script as in the initial comment. The volcano plot becomes

And the estimated logFC becomes

    APNVVVTR 
-0.002480005

which is very close to the empirical value mentioned in the comment above.

Important notes

I did not test this in case of interactions and expect the method to crash with interactions.
On top of wrong logFCs, I also found a bug in the variance estimation that was overestimated. This means that computed variances (for 2 group comparisons) are now lower, hence meaning that more peptides are found significant.

This approach is implemented since https://github.com/UCLouvain-CBIO/scp/commit/6b340848afe6ad3214b2c96b495e0aa234c082db.

cvanderaa commented 5 months ago

I consider this issue as solved unless @lgatto you have additional comments/suggestions?

shimin-chen commented 4 months ago

Hello,

I believe I am using the version (SHA1: 610cdffa) in which this issue has been corrected. However, when I am using the scpDifferentialAnalysis() function, there seemed to be some mistakes in the result. When I manually compute the mean as mentioned above and compared that with the estimates generated from the scpDifferentialAnalysis() function, there is a discrepancy. In some extreme cases, I can see when a protein that should've had higher abundance in one group is shown to have higher abundance in the opposite group. I have four different groups in my data. I tried scpDifferentialAnalysis(sce, contrast = list(c("SampleType", "group1", "group1")) as a step to troubleshoot the issue, I am getting a volcano plot that shows significant differential expression, which is not what I expect to see.

cvanderaa commented 4 months ago

Hello @shimin-chen,

Thanks again for reaching out. Could you please provide a minimal reproducible example so that I can use your example as a test case for debugging? For instance, you can provide a QFeatures object with one of the peptides for which you see problematic estimation as well as the code that runs the model and computes the mean.

Meanwhile, here are a few comments:

You should indeed use the code since commit https://github.com/UCLouvain-CBIO/scp/commit/610cdffa3a8d7fb9c0d62d5076187143fae5082d. Make sure that you have this code installed (I know from personal experience that you can quickly overwrite package installation when using cutom refs/commits)
If you model SampleType with other variables, make sure to compare means on data with only SampleType effect, that is after using scpKeepEffect().
A contrast that involves the same group, eg group1 vs group1, should not be used as a troubleshoot. Thanks for pointing this out, I will make it return an error (I'm surprised it did not return an error naively).

shimin-chen commented 4 months ago

Hi Christophe,

Once again, I appreciate your response and continued support and development on this tool.

I tried to come up with a QFeatures object with just the problematic peptides using sce1 <- sce[c(peptide1, peptide2), , ]. Unfortunately, I got an error subscript contains invalid names when trying to run scpDifferentialAnalysis(sce1, contrast = list(c("SampleType", "group1", "group2"))) . Please let me know if you have suggestions on how to generate the QFeatures object with just the problematic peptides.

To respond to your comments-

yes, I checked again using devtools::package_info("scp")and indeed it return

scp * 1.13.2 2024-02-26 [1] Github (UCLouvain-CBIO/scp@610cdff)
I used the exact same code as shown in your reproducible example above, with the necessary modifications on the peptide sequence and the column name for group information. I did use scpKeepEffect() before calculating mean.

I could be wrong about this - I felt that the issue may have something to do with the contrast matrix resulting from having more than 2 groups in SampleType (I have 4 groups in total).

Please let me know if I can provide more information for debugging.

cvanderaa commented 4 months ago

Again, thanks for pointing out a new issue! I opened an issue here #58 to solve this. Since I think it will take us some time to solve this, would it be ok for you to share your full data set? Depending on its size, you may need to share it through an external server like GDrive, Dropbox, OneDrive, etc. Then, could you also provide the minimal code you used to highlight your issue? Even if it is almost copy-pasted from my comment, it will avoid me guess work :wink:

If your data is confidential and you are not allowed to share it, could you try to first subset sce1, then run ScpModelWorkflow() and then see if you can repeat the problem? If yes, you can send me that subset data set.

shimin-chen commented 4 months ago

I managed to reproduce the issue with a subsetted sce object to be used before running the ScpModelWorkflow(). Below is the code for reproducing the issue. I changed the SampleType names to show only the first letter.

library(scp)
library(ggplot2)
load("sce_debugging3.RData")

# modeling
sce <- scpModelWorkflow(
  sce1,
  formula = ~ 1 + 
    MedianIntensity +
    File.ID + 
    SampleType
)

# Differential analysis 
group1 <- "P"
group2 <- "D"

daRes <- scpDifferentialAnalysis(
  sce, contrast = list(c("SampleType", group1, group2))
)

scpVolcanoPlot(daRes)

# Calculate mean 
i <- "[K].aTGPPVSELITk.[A]"
y <- assay(scpKeepEffect(sce, "SampleType"))[i, ]
type <- sce$SampleType
(means <- sapply(split(y, type), mean, na.rm = TRUE))

# show different results
means[[group1]] - means[[group2]] 
daRes[[1]]$Estimate[daRes[[1]]$feature == i]

I get this result

> means[[group1]] - means[[group2]] 
[1] -0.3273939
> daRes[[1]]$Estimate[daRes[[1]]$feature == i]
[K].aTGPPVSELITk.[A] 
           0.8456961

Interestingly, when I reorder the factor level in SampleType, the calculation is correct.

library(scp)
library(ggplot2)
load("sce_debugging3.RData")

# change the order of factor level
library(stringr)
sce1$SampleType <- factor(sce1$SampleType, levels = c("D", "H", "L", "P")) 

# modeling
sce <- scpModelWorkflow(
  sce1,
  formula = ~ 1 + 
    MedianIntensity +
    File.ID + 
    SampleType
)

# Differential analysis 
group1 <- "P"
group2 <- "D"

daRes <- scpDifferentialAnalysis(
  sce, contrast = list(c("SampleType", group1, group2))
)

scpVolcanoPlot(daRes)

# Calculate mean 
i <- "[K].aTGPPVSELITk.[A]"
y <- assay(scpKeepEffect(sce, "SampleType"))[i, ]
type <- sce$SampleType
(means <- sapply(split(y, type), mean, na.rm = TRUE))

# show results
means[[group1]] - means[[group2]] 
daRes[[1]]$Estimate[daRes[[1]]$feature == i]

I seem to be getting the correct result this time

> means[[group1]] - means[[group2]] 
[1] -0.3273944
> daRes[[1]]$Estimate[daRes[[1]]$feature == i]
[K].aTGPPVSELITk.[A] 
          -0.3274025

If I compare the same group:

# Differential analysis 
group1 <- "P"
group2 <- "P"

I get:

> means[[group1]] - means[[group2]] 
[1] 0
> daRes[[1]]$Estimate[daRes[[1]]$feature == i]
[K].aTGPPVSELITk.[A] 
           0.6724825

cvanderaa commented 4 months ago

This is very helpful! I was able to reproduce your error and found a nasty little bug in the code that, as you have noticed, incorrectly assigned the levels when building contrasts.

This is fixed since https://github.com/UCLouvain-CBIO/scp/commit/43490720345757957289ddb14220f3937f879490

So, many thanks again for pointing this out and for providing an example that facilitated debugging. My apologies for the inconvenience :pray:

shimin-chen commented 4 months ago

Fantastic! Thank you so much for the prompt response and actions on this issue 😀 Hope you have a great weekend!