drisso / archive-zinbwave

Archived version of the zinbwave package. All future development will be done on https://github.com/drisso/zinbwave
6 stars 0 forks source link

Use just for normalisation? #34

Closed JohnReid closed 7 years ago

JohnReid commented 7 years ago

Thanks for a nice model with a good implementation. I'm curious if I can use the model just to normalise my single cell counts. I'm not interested in a factor analysis but would like to remove batch effects and library size effects from the counts before applying my own model. What would you recommend to do? I'm thinking just to fit the model and subtract the expected non-zero counts from the counts. This ignores the ZI probability though. I suppose it is difficult to separate the technical effects from the biological effects for the ZI probabilities.

jpvert commented 7 years ago

One possibility would be to run the model with your batch and library size effects captured by covariates, and with "enough" latent factors to infer W and (alpha_mu,alpha_pi). Then Walpha_mu is our best estimate of the log-expression level, and Walpha_pi is our best estimate of the logic-probability of dropout. I would not recommend directly fitting a mode without latent factors and then subtracting the estimated batch and library size effects from the raw counts, since our model really assumes a statistical distribution for observed counts which is not so simple. If the method you want to use afterwards require a matrix of data that "looks" like the original counts but without batch and library size effects, then maybe a possibility is to simulate counts using the ZINB distributions with parameters Walpha_mu and Walpha_pi, although this looks a bit artificial compared to directly work with the matrices Walpha_mu and Walpha_pi. Finally, regarding how many latent factors to use to estimate W and alpha, you can set it to a small number (say k=2) if you believe the real data have two degrees of freedom, or just set it to a larger value (say k=10 or 50) if not; note that we did not really test the later case, the math says that this should not be a problem, but I hope the numerical errors will not accumulate. Let us know!

drisso commented 7 years ago

Hi @JohnReid

please check out the latest version of the package (v0.99.5). We now provide the function zinbwave() which can be used, among other things, to compute normalized values (see example in the vignette).

Obviously, this is still under development, so any feedback is appreciated!

JohnReid commented 7 years ago

Hi @drisso and @jpvert, Thanks for your help with this. I tried jpvert's suggestion and to be honest didn't get spectacular results with my pseudotime estimation method using the W alpha_mu as estimates of the log expression. I've just tried drisso's suggestion of zinbwave(). I'm using the latest version on github and I have

> sce2 <- zinbwave(scezinb, X = "~ treatment + individual")
Error: BiocParallel errors
  element index: 1, 2, 3, 4, 5, 6, ...
  first error: initial value in 'vmmin' is not finite
In addition: There were 31 warnings (use warnings() to see them)
> dim(scezinb)
[1]  300 1403
> warnings()
Warning messages:
1: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
2: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
3: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
4: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
5: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
6: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
7: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
8: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
9: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
...

I might play around with it a bit more but I don't want to spend too much time if the optimisation isn't working.

drisso commented 7 years ago

Hi @JohnReid

Did zinbFit() work on the same example?

can you please rerun zinbwave in serial mode?

E.g.

sce2 <- zinbwave(scezinb, X = "~ treatment + individual", BPPARAM = BiocParallel::SerialParam())

I found that the error messages are more informative that way. It would also be helpful to know the version of the package that your using.

JohnReid commented 7 years ago

Thanks for the response @drisso, I've just got back from holiday and have re-run it in serial mode:

> sce2 <-
+   zinbwave(
+     scezinb,
+     X = "~ treatment + individual",
+     BPPARAM = BiocParallel::SerialParam())
Error in optim(fn = zinb.loglik.regression, gr = zinb.loglik.regression.gradient,  :
  initial value in 'vmmin' is not finite
In addition: There were 31 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
2: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
3: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
4: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
5: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
6: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
7: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
8: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
9: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
10: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
11: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
12: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
13: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
14: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
15: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
16: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
17: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
18: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
19: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
20: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
21: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
22: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
23: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
24: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
25: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
26: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
27: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
28: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
29: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
30: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
31: In optimize(f = zinb.loglik.dispersion, Y = Y, mu = mu,  ... :
  NA/Inf replaced by maximum positive value
>

and my session information is below. It doesn't look like the latest github version of zinbwave to me

> devtools::session_info()
Session info ----------------------------------------------------------------------------------------------------------
 setting  value
 version  R version 3.4.1 (2017-06-30)
 system   x86_64, linux-gnu
 ui       X11
 language en_GB:en
 collate  en_GB.UTF-8
 tz       GB
 date     2017-08-07

Packages --------------------------------------------------------------------------------------------------------------
 package              * version  date       source
 ADGofTest              0.3      2011-12-28 CRAN (R 3.2.5)
 AnnotationDbi          1.38.2   2017-07-27 Bioconductor
 assertthat             0.2.0    2017-04-11 CRAN (R 3.3.3)
 backports              1.1.0    2017-05-22 CRAN (R 3.4.0)
 base                 * 3.4.1    2017-07-08 local
 bindr                  0.1      2016-11-13 CRAN (R 3.4.1)
 bindrcpp               0.2      2017-06-17 CRAN (R 3.4.0)
 Biobase              * 2.36.2   2017-06-20 Bioconductor
 BiocGenerics         * 0.22.0   2017-06-20 Bioconductor
 BiocParallel         * 1.10.1   2017-06-20 Bioconductor
 biomaRt              * 2.32.1   2017-06-20 Bioconductor
 bit                    1.1-12   2014-04-09 CRAN (R 3.4.0)
 bit64                  0.9-7    2017-05-08 CRAN (R 3.4.0)
 bitops                 1.0-6    2013-08-17 CRAN (R 3.2.2)
 blob                   1.1.0    2017-06-17 CRAN (R 3.4.0)
 broom                * 0.4.2    2017-02-13 CRAN (R 3.2.5)
 caTools                1.17.1   2014-09-10 CRAN (R 3.2.2)
 class                  7.3-14   2015-08-30 CRAN (R 3.4.0)
 cluster                2.0.6    2017-03-16 CRAN (R 3.4.0)
 coda                   0.19-1   2016-12-08 CRAN (R 3.4.0)
 codetools              0.2-15   2016-10-05 CRAN (R 3.3.1)
 colorout             * 1.1-1    2015-12-02 local
 colorspace             1.3-2    2016-12-14 CRAN (R 3.2.5)
 commonmark             1.2      2017-03-01 CRAN (R 3.4.0)
 compiler               3.4.1    2017-07-08 local
 copula                 0.999-17 2017-06-18 CRAN (R 3.4.0)
 crayon                 1.3.2    2016-06-28 CRAN (R 3.2.5)
 datasets             * 3.4.1    2017-07-08 local
 DBI                    0.7      2017-06-18 CRAN (R 3.4.0)
 DelayedArray         * 0.2.7    2017-06-20 Bioconductor
 DeLorean             * 1.2.5    <NA>       local
 dendextend             1.5.2    2017-03-28 CRAN (R 3.3.3)
 DEoptimR               1.0-8    2016-11-19 CRAN (R 3.3.3)
 devtools             * 1.13.2   2017-06-02 CRAN (R 3.4.0)
 digest                 0.6.12   2017-01-27 CRAN (R 3.2.5)
 diptest                0.75-7   2015-06-08 CRAN (R 3.2.5)
 dplyr                * 0.7.2    2017-07-20 CRAN (R 3.4.1)
 evaluate               0.10.1   2017-06-24 CRAN (R 3.4.1)
 fastICA                1.2-1    2017-06-12 CRAN (R 3.4.0)
 flexmix                2.3-14   2017-04-28 CRAN (R 3.4.0)
 foreach                1.4.3    2015-10-13 CRAN (R 3.2.3)
 foreign                0.8-69   2017-06-21 CRAN (R 3.4.1)
 fpc                    2.1-10   2015-08-14 CRAN (R 3.2.5)
 functional           * 0.6      2014-07-16 CRAN (R 3.2.2)
 gclus                  1.3.1    2012-06-25 CRAN (R 3.2.3)
 gdata                  2.18.0   2017-06-06 CRAN (R 3.4.0)
 GenomeInfoDb         * 1.12.2   2017-06-20 Bioconductor
 GenomeInfoDbData       0.99.0   2017-06-20 Bioconductor
 GenomicRanges        * 1.28.4   2017-07-27 Bioconductor
 ggjoy                * 0.2.0    2017-07-24 CRAN (R 3.4.1)
 ggplot2              * 2.2.1    2016-12-30 CRAN (R 3.2.5)
 ggthemes             * 3.4.0    2017-02-19 CRAN (R 3.4.0)
 git2r                  0.19.0   2017-07-19 cran (@0.19.0)
 glmnet                 2.0-10   2017-05-06 CRAN (R 3.4.0)
 glue                   1.1.1    2017-06-21 CRAN (R 3.4.1)
 gplots                 3.0.1    2016-03-30 CRAN (R 3.2.4)
 graphics             * 3.4.1    2017-07-08 local
 grDevices            * 3.4.1    2017-07-08 local
 grid                   3.4.1    2017-07-08 local
 gridExtra              2.2.1    2016-02-29 CRAN (R 3.2.5)
 gsl                    1.9-10.3 2017-01-05 CRAN (R 3.2.5)
 gtable                 0.2.0    2016-02-26 CRAN (R 3.2.4)
 gtools                 3.5.0    2015-05-29 CRAN (R 3.2.2)
 highr                  0.6      2016-05-09 CRAN (R 3.2.5)
 hms                    0.3      2016-11-22 CRAN (R 3.4.0)
 htmltools              0.3.6    2017-04-28 CRAN (R 3.4.0)
 inline                 0.3.14   2015-04-13 CRAN (R 3.2.2)
 IRanges              * 2.10.2   2017-06-20 Bioconductor
 iterators              1.0.8    2015-10-13 CRAN (R 3.2.3)
 kernlab                0.9-25   2016-10-03 CRAN (R 3.2.5)
 KernSmooth             2.23-15  2015-06-29 CRAN (R 3.4.0)
 knitr                  1.16     2017-05-18 CRAN (R 3.4.0)
 labeling               0.3      2014-08-23 CRAN (R 3.2.2)
 lattice                0.20-35  2017-03-25 CRAN (R 3.3.3)
 lazyeval               0.2.0    2016-06-12 CRAN (R 3.2.5)
 magrittr             * 1.5      2014-11-22 CRAN (R 3.2.2)
 MASS                   7.3-47   2017-04-21 CRAN (R 3.4.0)
 Matrix                 1.2-10   2017-04-28 CRAN (R 3.4.0)
 matrixStats          * 0.52.2   2017-04-14 CRAN (R 3.3.3)
 mclust                 5.3      2017-05-21 CRAN (R 3.4.0)
 memoise                1.1.0    2017-04-21 CRAN (R 3.4.0)
 methods              * 3.4.1    2017-07-08 local
 mnormt                 1.5-5    2016-10-15 CRAN (R 3.2.5)
 modeltools             0.2-21   2013-09-02 CRAN (R 3.2.5)
 munsell                0.4.3    2016-02-13 CRAN (R 3.2.4)
 mvtnorm                1.0-6    2017-03-02 CRAN (R 3.2.5)
 nlme                   3.1-131  2017-02-06 CRAN (R 3.4.0)
 nnet                   7.3-12   2016-02-02 CRAN (R 3.4.0)
 numDeriv               2016.8-1 2016-08-27 CRAN (R 3.2.5)
 nvimcom              * 0.9-34   2017-07-21 local
 parallel             * 3.4.1    2017-07-08 local
 pcaPP                  1.9-72   2017-06-27 CRAN (R 3.4.1)
 pkgconfig              2.0.1    2017-03-21 CRAN (R 3.4.0)
 plyr                   1.8.4    2016-06-08 CRAN (R 3.2.5)
 prabclus               2.2-6    2015-01-14 CRAN (R 3.2.5)
 pspline                1.0-18   2017-06-12 CRAN (R 3.4.0)
 psych                  1.7.5    2017-05-03 CRAN (R 3.4.0)
 R6                     2.2.2    2017-06-17 CRAN (R 3.4.0)
 RColorBrewer         * 1.1-2    2014-12-07 CRAN (R 3.2.4)
 Rcpp                 * 0.12.12  2017-07-15 CRAN (R 3.4.1)
 RCurl                  1.95-4.8 2016-03-01 CRAN (R 3.2.5)
 readr                * 1.1.1    2017-05-16 CRAN (R 3.4.0)
 registry               0.3      2015-07-08 CRAN (R 3.2.3)
 reshape2             * 1.4.2    2016-10-22 CRAN (R 3.2.5)
 rlang                  0.1.1    2017-05-18 CRAN (R 3.4.0)
 rmarkdown              1.6      2017-06-15 CRAN (R 3.4.0)
 robustbase             0.92-7   2016-12-09 CRAN (R 3.3.3)
 roxygen2               6.0.1    2017-02-06 CRAN (R 3.4.0)
 rprojroot              1.2      2017-01-16 CRAN (R 3.2.5)
 RSQLite                2.0      2017-06-19 CRAN (R 3.4.0)
 rstan                  2.16.2   2017-07-03 CRAN (R 3.4.1)
 rstudioapi             0.6      2016-06-27 CRAN (R 3.2.5)
 S4Vectors            * 0.14.3   2017-06-20 Bioconductor
 scales                 0.4.1    2016-11-09 CRAN (R 3.2.5)
 seriation              1.2-2    2017-05-09 CRAN (R 3.4.0)
 setwidth             * 1.0-4    2015-07-07 CRAN (R 3.2.2)
 SingleCellExperiment * 0.99.1   2017-07-23 Github (drisso/SingleCellExperiment@7ef7fb7)
 softImpute             1.4      2015-04-08 CRAN (R 3.2.5)
 stabledist             0.7-1    2016-09-12 CRAN (R 3.2.5)
 StanHeaders            2.16.0-1 2017-07-03 CRAN (R 3.4.1)
 stats                * 3.4.1    2017-07-08 local
 stats4               * 3.4.1    2017-07-08 local
 stringi                1.1.5    2017-04-07 CRAN (R 3.3.3)
 stringr              * 1.2.0    2017-02-18 CRAN (R 3.2.5)
 SummarizedExperiment * 1.6.3    2017-06-21 Bioconductor
 testthat               1.0.2    2016-04-23 CRAN (R 3.2.5)
 tibble                 1.3.3    2017-05-28 CRAN (R 3.4.0)
 tidyr                  0.6.3    2017-05-15 CRAN (R 3.4.0)
 tools                  3.4.1    2017-07-08 local
 trimcluster            0.1-2    2012-10-29 CRAN (R 3.2.5)
 tsne                 * 0.1-3    2016-07-15 CRAN (R 3.4.0)
 TSP                    1.1-5    2017-02-22 CRAN (R 3.4.0)
 utils                * 3.4.1    2017-07-08 local
 viridis                0.4.0    2017-03-27 CRAN (R 3.3.3)
 viridisLite            0.2.0    2017-03-24 CRAN (R 3.3.3)
 whisker                0.3-2    2013-04-28 CRAN (R 3.2.2)
 withr                  1.0.2    2016-06-20 CRAN (R 3.4.0)
 workflowr              0.7.0    2017-07-21 Github (jdblischak/workflowr@fc72b0d)
 XML                    3.98-1.9 2017-06-19 CRAN (R 3.4.0)
 xml2                   1.1.1    2017-01-24 CRAN (R 3.3.3)
 XVector                0.16.0   2017-06-20 Bioconductor
 yaml                   2.1.14   2016-11-12 CRAN (R 3.2.5)
 zinbwave             * 0.99.6   <NA>       Bioconductor
 zlibbioc               1.22.0   2017-06-20 Bioconductor
>
drisso commented 7 years ago

Thanks for running this.

Does the same error happen with zinbFit() or is it specific of the zinbwave() function? Would it be possible for you to provide a subset of the matrix that you're using as input that reproduces the same error for me to try out?

I'm opening a new issue more specific to this problem.

JohnReid commented 7 years ago

Sorry it wasn't quite clear to me from the vignette what the normalized values are. Are they the W alpha_mu or something else?

drisso commented 7 years ago

They are in an additional assay() slot, you can retrieve them with:

assay(x, "normalizedValues")

where x is the result of zinbwave()

JohnReid commented 7 years ago

@drisso thanks for the reply. I meant more what the normalized values are mathematically rather than how can I access them. Are they the W alpha_mu as @jpvert suggested above? If not are they documented anywhere? I had a quick look in the code but residual deviances are not something I've worked with much before.

drisso commented 7 years ago

Sorry, we should probably add more documentation on this. The deviance residuals are a fairly standard quantity in GLMs, and they can be interpreted as a generalization of the residuals that can be defined in linear models, see for instance this post for an explanation on deviance residuals in logistic regression.

In our case, we "leave out" W alpha when computing the residuals so that the normalized values are essentially the residuals of the model when only the covariates in X and V (i.e., the scaling factors and any batch correction) are removed from the data, but the signal of interest is still there.

JohnReid commented 7 years ago

@drisso Thanks for all your help, I really appreciate it. I'll dig a bit deeper into the deviance residuals using that post. So just to be absolutely clear, you recommend fitting a model using zinbwave with a large enough K for a good fit. That will give me the normalised values (which will ignore W).

drisso commented 7 years ago

Because the residuals are computed ignoring W, if you're interested solely on the normalized values you can even fit zinbwave with K=0. That would be much faster than fitting the model with a large value of W. But of course, the advantage of having a non zero K is that you get the low-rank representation of the data, W, along with the normalized values.