P-value vs. log fold change discrepancies when fitting a model with a covariate

argschwind commented 4 years ago

Hi,

I'm analyzing some single-cell RNA-seq data from a pooled CRISPR-screen. Using the known target genes of well understood CRISPR interference perturbations we compared different DE methods, and came to the conclusion that MAST with a nGenes (ngeneson in your tutorial) covariate performs best based on a AUPRC analysis.

However, when extracting the effect sizes for the individual coefficients, mainly the CRISPR perturbation effect, I discovered that the most significant genes tend to have very low log fold changes in expression. This is only the case when fitting a model with the nGenes covariate. When fitting a model with only the perturbation effect, significant genes also show a larger logFC as I would expect. Furthermore, I noticed that in many cases nGenes is highly significant, although only showing relatively small logFC changes.

I attached an example, which can be run by downloading an sca object from Google drive.

Fitting the model with nGenes covariate and create volcano plot with p-value vs. logFC for perturbation:

library(MAST)
library(googledrive)
library(tidyverse)

# download example data from google drive
sca_url <- "https://drive.google.com/file/d/126SsL2mIAEQeKRsQwMpGKYqmnHMNdHt_/view?usp=sharing"
tmp <- tempfile(fileext = ".rds")
drive_download(sca_url, path = tmp)
sca <- readRDS(tmp)

# fit model
zlm_fit <- zlm(~pert+nGenes, sca)

# perform likelihood ratio test for the perturbation coefficient
summary_zlm_fit <- summary(zlm_fit, doLRT = "pert1")
summary_dt <- summary_zlm_fit$datatable

# extract p-values and logFC for each gene
pvals <- summary_dt[contrast == "pert1" & component == "H", .(primerid, `Pr(>Chisq)`)]
lfc <- summary_dt[contrast == "pert1" & component == "logFC", .(primerid, coef, ci.hi, ci.lo)]

# assemble output
output <- as.data.frame(merge(lfc, pvals, by = "primerid"))
colnames(output) <- c("gene", "logFC", "ci_high", "ci_low", "pvalue")
output <- output[order(output$pvalue), ]

# volcano plot for perturbation effect
qplot(x = logFC, y = -log10(pvalue), data = output, main = "perturbation") +
  theme_bw()

perturbation

Creating the same plot for the nGenes covariate:

# perform likelihood ratio test for nGenes coefficient
summary_zlm_fit <- summary(zlm_fit, doLRT = "nGenes")
summary_dt <- summary_zlm_fit$datatable
pvals <- summary_dt[contrast == "nGenes" & component == "H", .(primerid, `Pr(>Chisq)`)]
lfc <- summary_dt[contrast == "nGenes" & component == "logFC", .(primerid, coef, ci.hi, ci.lo)]
output <- as.data.frame(merge(lfc, pvals, by = "primerid"))
colnames(output) <- c("gene", "logFC", "ci_high", "ci_low", "pvalue")
output <- output[order(output$pvalue), ]

# volcano plot for nGenes effect
qplot(x = logFC, y = -log10(pvalue), data = output, main = "nGenes") +
  theme_bw()

nGenes

When fitting a model with perturbation as the only variable, the volcano plot looks fine:

# refit model with only the perturbation coefficient
zlm_fit <- zlm(~pert, sca)

# perform likelihood ratio test for perturbation coefficient
summary_zlm_fit <- summary(zlm_fit, doLRT = "pert1")
summary_dt <- summary_zlm_fit$datatable
pvals <- summary_dt[contrast == "pert1" & component == "H", .(primerid, `Pr(>Chisq)`)]
lfc <- summary_dt[contrast == "pert1" & component == "logFC", .(primerid, coef, ci.hi, ci.lo)]
output <- as.data.frame(merge(lfc, pvals, by = "primerid"))
colnames(output) <- c("gene", "logFC", "ci_high", "ci_low", "pvalue")
output <- output[order(output$pvalue), ]

# volcano plot for model with only perturbation
qplot(x = logFC, y = -log10(pvalue), data = output, main = "model with only perturbation")

pertOnly

Note: In this example none of the genes would pass an FDR threshold of 0.05, but the same effects are also seen for highly significant cases.

My R session:

R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] forcats_0.5.0               stringr_1.4.0              
 [3] dplyr_1.0.0                 purrr_0.3.4                
 [5] readr_1.3.1                 tidyr_1.1.0                
 [7] tibble_3.0.1                ggplot2_3.3.2              
 [9] tidyverse_1.3.0             googledrive_1.0.1          
[11] MAST_1.14.0                 SingleCellExperiment_1.10.1
[13] SummarizedExperiment_1.18.1 DelayedArray_0.14.0        
[15] matrixStats_0.56.0          Biobase_2.48.0             
[17] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2        
[19] IRanges_2.22.2              S4Vectors_0.26.1           
[21] BiocGenerics_0.34.0        

loaded via a namespace (and not attached):
 [1] httr_1.4.1             jsonlite_1.7.0         modelr_0.1.8          
 [4] assertthat_0.2.1       askpass_1.1            blob_1.2.1            
 [7] GenomeInfoDbData_1.2.3 cellranger_1.1.0       yaml_2.2.1            
[10] progress_1.2.2         pillar_1.4.4           backports_1.1.8       
[13] lattice_0.20-41        glue_1.4.1             digest_0.6.25         
[16] XVector_0.28.0         rvest_0.3.5            colorspace_1.4-1      
[19] Matrix_1.2-18          plyr_1.8.6             pkgconfig_2.0.3       
[22] broom_0.5.6            haven_2.3.1            zlibbioc_1.34.0       
[25] scales_1.1.1           openssl_1.4.2          generics_0.0.2        
[28] farver_2.0.3           ellipsis_0.3.1         withr_2.2.0           
[31] cli_2.0.2              magrittr_1.5           crayon_1.3.4          
[34] readxl_1.3.1           fs_1.4.2               fansi_0.4.1           
[37] nlme_3.1-148           xml2_1.3.2             tools_4.0.0           
[40] data.table_1.12.8      prettyunits_1.1.1      hms_0.5.3             
[43] gargle_0.5.0           lifecycle_0.2.0        munsell_0.5.0         
[46] reprex_0.3.0           compiler_4.0.0         rlang_0.4.6           
[49] grid_4.0.0             RCurl_1.98-1.2         rstudioapi_0.11       
[52] bitops_1.0-6           labeling_0.3           gtable_0.3.0          
[55] abind_1.4-5            DBI_1.1.0              curl_4.3              
[58] reshape2_1.4.4         R6_2.4.1               lubridate_1.7.9       
[61] stringi_1.4.6          Rcpp_1.0.5             vctrs_0.3.1           
[64] dbplyr_1.4.4           tidyselect_1.1.0

gfinak commented 4 years ago

What do you get if you center and scale your nGenes covariate?

argschwind commented 4 years ago

Thanks for the quick response. Scaling the nGenes covariate does the trick, and the log fold changes now look as expected.

# center and scale nGenes
cd <- colData(sca)
cd$nGenes <- as.numeric(scale(cd$nGenes))
colData(sca) <- cd

# histogram of centered and scaled gene counts
hist(cd$nGenes, main = "nGenes centered and scaled", xlab = "nGenes")

nGenes_distr

# fit model
zlm_fit <- zlm(~pert+nGenes, sca)

# perform likelihood ratio test for the perturbation coefficient
summary_zlm_fit <- summary(zlm_fit, doLRT = "pert1")
summary_dt <- summary_zlm_fit$datatable

# extract p-values and logFC for each gene
pvals <- summary_dt[contrast == "pert1" & component == "H", .(primerid, `Pr(>Chisq)`)]
lfc <- summary_dt[contrast == "pert1" & component == "logFC", .(primerid, coef, ci.hi, ci.lo)]

# assemble output
output <- as.data.frame(merge(lfc, pvals, by = "primerid"))
colnames(output) <- c("gene", "logFC", "ci_high", "ci_low", "pvalue")
output <- output[order(output$pvalue), ]

# volcano plot for perturbation effect
qplot(x = logFC, y = -log10(pvalue), data = output, main = "perturbation") +
  theme_bw()

volcano_fixed

Best, Andreas

RGLab / MAST

P-value vs. log fold change discrepancies when fitting a model with a covariate #137