YosefLab / SymSim

23 stars 8 forks source link

getDEgenes reports strange logfoldchanges #24

Closed fhausmann closed 3 years ago

fhausmann commented 3 years ago

The theoretical logFoldchanges from SymSim compared to the ones reported by other methods, e.g. scanpy, on the true_counts are rather different. I expected that there is at least some correlation, because for sure the values between different methods differ, but it is calculated on the same data without any further noise. Unfortunately it is not the case and both methods return quite different values. Because we found that scanpy's logfoldchanges are correlated with those from other DE testing frameworks (tested on other datasets), I assume that the underlying cause is in getDEgenes. Maybe there is just some usage mistake, where you can help and clarify the difference. Thanks in advance for your help.

getDEgenes

Code to reproduce:

library("SymSim")
params =  list(
    min_popsize = 1000,
    i_minpop = 2,
    nevf = 10,
    evf_type = 'discrete',
    n_de_evf = 9,
    vary = 's',
    batch_effect_size = 1,
    gene_effects_sd = 2,
    Sigma = 0.2,
    scale_s = 0.4,
    alpha_mean = 0.05,
    alpha_sd = 0.025,
    depth_mean = 3e+05,
    depth_sd = 150000,
    nPCR1 = 14,
    prop_hge = 0.01,
    mean_hge = 3,
    protocol = 'UMI',
    gene_effect_prob = 0.3,
    ncells_total = 20000,
    ngenes = 2000,
    randseed = 0,
    nbatch = 2
)
phyla1 <- Phyla5()
data(gene_len_pool)
gene_len <- sample(gene_len_pool, params$ngenes, replace = FALSE)
true_counts_res <- SimulateTrueCounts(ncells_total = params$ncells_total, min_popsize = params$min_popsize, 
    i_minpop = params$i_minpop, ngenes = params$ngenes, nevf = params$nevf, evf_type = params$evf_type, 
    n_de_evf = params$n_de_evf, vary = params$vary, Sigma = params$Sigma, phyla = phyla1, 
    randseed = params$randseed, gene_effects_sd = params$gene_effects_sd, gene_effect_prob = params$gene_effect_prob, 
    scale_s = params$scale_s, prop_hge = params$prop_hge, mean_hge = params$mean_hge)
true_counts_res_dis <- true_counts_res
Matrix::writeMM(Matrix::Matrix(true_counts_res$counts, sparse = TRUE), "counts.mtx")
write.csv(true_counts_res$cell_meta, "meta.csv")

DEinfo <- getDEgenes(true_counts_res = true_counts_res_dis, popA = 1, popB = 2)
de_data = as.data.frame(DEinfo)
write.csv(de_data, "1_vs_2.csv")

And scanpy for comparison:

import scanpy as sc
import scipy.io
import pandas as pd
import anndata
import seaborn as sns

counts = scipy.io.mmread("counts.mtx")
meta = pd.read_csv("meta.csv", index_col="cellid", usecols=["cellid","pop"])
adata = anndata.AnnData(X=counts.todense().T, obs=meta)
adata.obs["pop"] = adata.obs["pop"].astype("category")
sc.pp.log1p(adata)
sc.tl.rank_genes_groups(adata, method="t-test", groupby="pop", reference=1) 

measured = pd.DataFrame({"genes": adata.uns['rank_genes_groups']["names"]["2"],
                         "logfoldchanges":adata.uns['rank_genes_groups']["logfoldchanges"]["2"]})
measured.set_index("genes", inplace=True)

theoretical = pd.read_csv("1_vs_2.csv", index_col=0)
theoretical.index = theoretical.index.astype(str)

merged = pd.merge(measured,theoretical, left_index=True, right_index=True)
sns.relplot(x="logFC_theoretical", y="logfoldchanges", data=merged)

Versions:

R version 4.0.3 (2020-10-10)
Platform: x86_64-redhat-linux-gnu (64-bit)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.0

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] stats4    parallel  grid      stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] SymSim_0.0.0.9000           SummarizedExperiment_1.20.0
 [3] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
 [5] MatrixGenerics_1.2.1        matrixStats_0.58.0         
 [7] IRanges_2.24.1              S4Vectors_0.28.1           
 [9] repr_1.1.3                  phytools_0.7-70            
[11] maps_3.3.0                  roxygen2_7.1.1             
[13] stringi_1.5.3               MASS_7.3-53                
[15] ape_5.4-1                   Biobase_2.50.0             
[17] BiocGenerics_0.36.0         RColorBrewer_1.1-2         
[19] reshape_0.8.8               Rtsne_0.15                 
[21] ggplot2_3.3.3               plyr_1.8.6                 

loaded via a namespace (and not attached):
 [1] jsonlite_1.7.2          tmvnsim_1.0-2           gtools_3.8.2           
 [4] expm_0.999-6            GenomeInfoDbData_1.2.4  numDeriv_2016.8-1.1    
 [7] pillar_1.4.7            lattice_0.20-41         glue_1.4.2             
[10] quadprog_1.5-8          phangorn_2.5.5          uuid_0.1-4             
[13] digest_0.6.27           XVector_0.30.0          colorspace_2.0-0       
[16] htmltools_0.5.1.1       Matrix_1.2-18           pkgconfig_2.0.3        
[19] zlibbioc_1.36.0         purrr_0.3.4             scales_1.1.1           
[22] tibble_3.0.6            combinat_0.0-8          ellipsis_0.3.1         
[25] withr_2.4.1             mnormt_2.0.2            magrittr_1.5           
[28] crayon_1.4.1            evaluate_0.14           nlme_3.1-149           
[31] xml2_1.3.2              tools_4.0.3             lifecycle_1.0.0        
[34] stringr_1.4.0           munsell_0.5.0           DelayedArray_0.16.1    
[37] plotrix_3.8-1           compiler_4.0.3          clusterGeneration_1.3.7
[40] rlang_0.4.10            RCurl_1.98-1.2          pbdZMQ_0.3-5           
[43] IRkernel_1.1.1          igraph_1.2.6            bitops_1.0-6           
[46] base64enc_0.1-3         gtable_0.3.0            R6_2.5.0               
[49] knitr_1.30              fastmatch_1.1-0         IRdisplay_1.0          
[52] Rcpp_1.0.6              vctrs_0.3.6             scatterplot3d_0.3-41   
[55] xfun_0.20               coda_0.19-4            
fhausmann commented 3 years ago

Sorry for the confusion. Scanpy uses other indices than SymSim. Therefore the genes get mixed.
If you correct it, it looks much better.