drisso / zinbwave

Clone of the Bioconductor repository for the zinbwave package, see https://bioconductor.org/packages/zinbwave
43 stars 10 forks source link

Build model on subset of data #10

Closed jma1991 closed 6 years ago

jma1991 commented 6 years ago

Hello, I am working with a technology similar to scRNA-seq which produces a lot of drop-outs and results in over-estimated dispersion. I would like to use the zinbFit function to model the dispersion correctly (as shown in Figure 1 of your paper). My dataset is rather large (700,000 features by 6 samples), and the function takes hours to finish. I was wondering if there was a way to speed up the model building step? In a separate issue you spoke about using a subset of the data and then extrapolating the results to the full set of data. Is it possible to do that with the model building step? Many thanks, James.

drisso commented 6 years ago

Hi @jma1991

What do you mean exactly by building the model on a subset of data?

What we typically do in scRNA-seq data is to focus on the 1,000 or so most variable genes and infer W from those, so perhaps that might be a good solution for your data too. I.e., reduce the number of features with some reasonable model selection step before fitting the model.

jma1991 commented 6 years ago

Hi @drisso

I've attached an image of the dispersion estimates from edgeR for my data. The red line is the trended dispersion. In my mind it over-estimates the dispersion, particularly at the larger end of the AveLogCPM range. I was hoping to use zinbwave to fit a more appropriate trend. I could just filter the date more aggressively but I find that a lot of true positives at the lower end of the AveLogCPM range are filtered.

drisso commented 6 years ago

What is the goal of your analysis? Is it differential expression?

If so, you may have a look at our latest preprint on how to compute weights with zinbwave to be used with edgeR to get a DE model that accounts for zero inflation.

Preprint: https://www.biorxiv.org/content/early/2018/01/18/250126 Code: https://github.com/statOmics/zinbwaveZinger

If your goal is DE you can simply specify K=0 (the default) which should be fairly fast even with all the features.

jma1991 commented 6 years ago

It is to measure differential expression. Fitting the model to my data took quite a while:

> ptm <- proc.time()
> zinbModel <- zinbFit(assay(rawData), X = designMatrix, K = 0)
> proc.time() - ptm
    user   system  elapsed 
 182.403   43.476 4559.516

But computing the observational weights was instant:

> ptm <- proc.time()
> zinbWeights <- computeObservationalWeights(zinbModel, assay(rawData))
> proc.time() - ptm
   user  system elapsed 
  1.631   0.159   1.795

I've included my session info, incase you believe the timings are for some reason longer than you'd expect:

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] zinbwave_1.0.0             SingleCellExperiment_1.0.0
 [3] SummarizedExperiment_1.8.1 DelayedArray_0.4.1        
 [5] matrixStats_0.52.2         Biobase_2.38.0            
 [7] GenomicRanges_1.30.1       GenomeInfoDb_1.14.0       
 [9] IRanges_2.12.0             S4Vectors_0.16.0          
[11] BiocGenerics_0.24.0        edgeR_3.20.7              
[13] limma_3.34.5              

loaded via a namespace (and not attached):
 [1] pcaPP_1.9-73           Rcpp_0.12.14           pillar_1.1.0          
 [4] compiler_3.4.3         XVector_0.18.0         iterators_1.0.9       
 [7] bitops_1.0-6           tools_3.4.3            zlibbioc_1.24.0       
[10] digest_0.6.14          bit_1.1-12             memoise_1.1.0         
[13] tibble_1.4.1           annotate_1.56.1        RSQLite_2.0           
[16] lattice_0.20-35        pspline_1.0-18         rlang_0.1.6           
[19] foreach_1.4.4          Matrix_1.2-12          DBI_0.7               
[22] yaml_2.1.16            mvtnorm_1.0-6          GenomeInfoDbData_1.0.0
[25] copula_0.999-18        genefilter_1.60.0      glmnet_2.0-13         
[28] locfit_1.5-9.1         bit64_0.9-7            grid_3.4.3            
[31] ADGofTest_0.3          AnnotationDbi_1.40.0   survival_2.41-3       
[34] XML_3.98-1.9           BiocParallel_1.12.0    blob_1.1.0            
[37] codetools_0.2-15       splines_3.4.3          stabledist_0.7-1      
[40] softImpute_1.4         xtable_1.8-2           numDeriv_2016.8-1     
[43] gsl_1.9-10.3           RCurl_1.95-4.10

I've uploaded an image of the new dispersion estimate. The trend line doesn't seem to change, but the range of dispersion (Y-axis) seems to have halved.

drisso commented 6 years ago

Fitting the model to my data took quite a while: But computing the observational weights was instant:

Yes, that's the expected behavior. Are you using a single CPU or are you using parallel computations? That might speed up your code. Have a look at Section 7 of the vignette.