Closed jma1991 closed 6 years ago
Hi @jma1991
What do you mean exactly by building the model on a subset of data?
What we typically do in scRNA-seq data is to focus on the 1,000 or so most variable genes and infer W from those, so perhaps that might be a good solution for your data too. I.e., reduce the number of features with some reasonable model selection step before fitting the model.
Hi @drisso
I've attached an image of the dispersion estimates from edgeR for my data. The red line is the trended dispersion. In my mind it over-estimates the dispersion, particularly at the larger end of the AveLogCPM range. I was hoping to use zinbwave to fit a more appropriate trend. I could just filter the date more aggressively but I find that a lot of true positives at the lower end of the AveLogCPM range are filtered.
What is the goal of your analysis? Is it differential expression?
If so, you may have a look at our latest preprint on how to compute weights with zinbwave
to be used with edgeR to get a DE model that accounts for zero inflation.
Preprint: https://www.biorxiv.org/content/early/2018/01/18/250126 Code: https://github.com/statOmics/zinbwaveZinger
If your goal is DE you can simply specify K=0
(the default) which should be fairly fast even with all the features.
It is to measure differential expression. Fitting the model to my data took quite a while:
> ptm <- proc.time()
> zinbModel <- zinbFit(assay(rawData), X = designMatrix, K = 0)
> proc.time() - ptm
user system elapsed
182.403 43.476 4559.516
But computing the observational weights was instant:
> ptm <- proc.time()
> zinbWeights <- computeObservationalWeights(zinbModel, assay(rawData))
> proc.time() - ptm
user system elapsed
1.631 0.159 1.795
I've included my session info, incase you believe the timings are for some reason longer than you'd expect:
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] zinbwave_1.0.0 SingleCellExperiment_1.0.0
[3] SummarizedExperiment_1.8.1 DelayedArray_0.4.1
[5] matrixStats_0.52.2 Biobase_2.38.0
[7] GenomicRanges_1.30.1 GenomeInfoDb_1.14.0
[9] IRanges_2.12.0 S4Vectors_0.16.0
[11] BiocGenerics_0.24.0 edgeR_3.20.7
[13] limma_3.34.5
loaded via a namespace (and not attached):
[1] pcaPP_1.9-73 Rcpp_0.12.14 pillar_1.1.0
[4] compiler_3.4.3 XVector_0.18.0 iterators_1.0.9
[7] bitops_1.0-6 tools_3.4.3 zlibbioc_1.24.0
[10] digest_0.6.14 bit_1.1-12 memoise_1.1.0
[13] tibble_1.4.1 annotate_1.56.1 RSQLite_2.0
[16] lattice_0.20-35 pspline_1.0-18 rlang_0.1.6
[19] foreach_1.4.4 Matrix_1.2-12 DBI_0.7
[22] yaml_2.1.16 mvtnorm_1.0-6 GenomeInfoDbData_1.0.0
[25] copula_0.999-18 genefilter_1.60.0 glmnet_2.0-13
[28] locfit_1.5-9.1 bit64_0.9-7 grid_3.4.3
[31] ADGofTest_0.3 AnnotationDbi_1.40.0 survival_2.41-3
[34] XML_3.98-1.9 BiocParallel_1.12.0 blob_1.1.0
[37] codetools_0.2-15 splines_3.4.3 stabledist_0.7-1
[40] softImpute_1.4 xtable_1.8-2 numDeriv_2016.8-1
[43] gsl_1.9-10.3 RCurl_1.95-4.10
I've uploaded an image of the new dispersion estimate. The trend line doesn't seem to change, but the range of dispersion (Y-axis) seems to have halved.
Fitting the model to my data took quite a while: But computing the observational weights was instant:
Yes, that's the expected behavior. Are you using a single CPU or are you using parallel computations? That might speed up your code. Have a look at Section 7 of the vignette.
Hello, I am working with a technology similar to scRNA-seq which produces a lot of drop-outs and results in over-estimated dispersion. I would like to use the zinbFit function to model the dispersion correctly (as shown in Figure 1 of your paper). My dataset is rather large (700,000 features by 6 samples), and the function takes hours to finish. I was wondering if there was a way to speed up the model building step? In a separate issue you spoke about using a subset of the data and then extrapolating the results to the full set of data. Is it possible to do that with the model building step? Many thanks, James.