Oshlack / splatter

Simple simulation of single-cell RNA sequencing data
http://oshlacklab.com/splatter/
GNU General Public License v3.0
217 stars 57 forks source link

Question about estimating param and library sizes #38

Closed Vivianstats closed 6 years ago

Vivianstats commented 6 years ago

Hello,

Thanks for the useful tool.

I would like to simulate a count matrix by estimating all possible parameters from a real dataset. But when I try

splatEstimate(round(realcount))

I got the following error message:

<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data, gr = gradient, ddistnam = ddistname, hessian = TRUE, method = meth, lower = lower, upper = upper, ...): non-finite finite-difference value [2]> Error in fitdistrplus::fitdist(lib.sizes, "lnorm") : the function mle failed to estimate the parameters, with the error code 100

I was able to continue by just using the first 13,000 rows, but the library sizes differ a lot from the real data

params = newSplatParams()
params = setParam(params, "nGenes", nrow(realcount))
params = setParam(params, "groupCells", ncol(realcount))
splat_params = splatEstimate(round(realcount)[1:13000,], params)
params = setParam(params, "nGenes", nrow(realcount))
splatcount = splatSimulate(params, dropout.present = TRUE)
> summary(colSums(counts(splatcount)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  27926   46947   54304   56629   63769   99804 
> summary(colSums(realcount))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 997978  998964  999234  999143  999396  999862 

Can you provide some suggestions on this application? Thanks!

Here is the real datset I used: count.rds.zip

lazappi commented 6 years ago

HI @Vivianstats

Thanks for giving Splatter a go, hopefully we can get this issue sorted out for you.

Thanks for providing your dataset, that is really helpful. It looks like the library sizes here aren't log-normal distributed, which would explain why the fitting is failing. You can check this by running:

lib.sizes <- colSums(counts)
plot(density(lib.sizes))

image

Can you tell me anything more about the dataset (capture protocol, processing etc.)? This might help us work out if it is something specific to your data or if I need to adjust Splatter to account for this.

Vivianstats commented 6 years ago

Thanks for your quick reply!

The data was based on Smart-seq2. Hope this information helps.

As a follow-up question, what type of data (protocols) are better simulated by the current splat method ?

lazappi commented 6 years ago

Splatter was mainly developed using UMI datasets so we have a bit of a bias towards those.

I think I can probably adjust the Splat model so that it can choose between log-normal and normal distributions for the library size, I think in your case normal would be a better fit. I am on leave the next week so I'm not sure when I will have time to implement it but should be before the next Bioconductor release in May.

In the meantime have you tried any other the other simulations in Splatter? One of the others might work a bit better for your dataset if it has the features you need.

Sorry I can't be more helpful right now but hopefully I'll have time to look into it a bit more soon.

Vivianstats commented 6 years ago

Hello,

Thanks for your reply. I'm wondering among the other simulation methods in Splatter, which can be completed by just supplying a real dataset for estimation of parameters, and no parameters need to be manually set?

I checked scDDEstimate, but it looks like it needs gene numbers for all types of genes such as DE, DP, etc. This information is not easy to obtain for a real dataset.

I would like to do something like the one in your paper: based on a real dataset, I simulate data using multiple methods with all parameters estimated from the real data. How can I do that? I may have missed it, but do you have example codes demonstrating the simulations in your paper?

Thanks!

lazappi commented 6 years ago

All of the simulations in Splatter let you estimate parameters from a real dataset (except for the Lun simulation). Some of them require a bit of extra information about the dataset like groups or batches that are present. The code used in the paper is here https://github.com/Oshlack/splatter-paper but it was done using an older version of Splatter so some of it will be out of date. You could have a look at the help pages for the various estimate functions to see what they require (such as ?lun2Estimate). Try running listSims() to see a bit about all the different simulations, there are a few more since the paper.

lazappi commented 6 years ago

Hi @Vivianstats. I've made a new branch here https://github.com/Oshlack/splatter/tree/libsizes which allows you to use a normal distribution instead of a log-normal for library sizes (and estimates the parameters). I've done a (very quick) test using your data and it seems to work ok.

It will probably make it into the next Bioconductor release but if you want to try it out sooner you can install this version using:

devtools::install_github("Oshlack/splatter@libsizes")
Vivianstats commented 6 years ago

Thank you very much for spending time to add the features!

I am now able to get results from splatEstimate() and they look pretty good. However, I was using scDDEstimate() last week and was able to obtain results without any issues. After I reinstall the package, the following code would lead to errors (realcount is a matrix of counts):

params = splatter::scDDEstimate(realcount, conditions = sample(1:2, ncol(realcount), replace = TRUE))

Error in validObject(.Object) : 
  invalid class “SCDDParams” object: SCDat: Must have class 'SingleCellExperiment', but has class 'SummarizedExperiment'

I don't have any experience with 'SingleCellExperiment', and I thought my codes should invoke scDDEstimate.matrix, not scDDEstimate.SingleCellExperiment. Could you please give me some help on this issue?

Thanks a lot!

lazappi commented 6 years ago

That looks like a problem with the Splatter version and maybe some other packages. Can you run the code that caused the error again and then paste the output of sessionInfo() here?

Do you think the changes I made are useful? If so I will look at merging them into the development version.

Vivianstats commented 6 years ago

Yes, I do think this version is very helpful!

I have attached the output of session_info() here:

> session_info()
Session info ----------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.2 (2017-09-28)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.442)           
 language en_US                       
 collate  en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2018-04-16                  

Packages --------------------------------------------------------------------------------------
 package              * version    date       source                           
 abind                  1.4-5      2016-07-21 CRAN (R 3.4.2)                   
 AnnotationDbi          1.38.2     2017-10-23 Bioconductor                     
 arm                    1.10-1     2018-04-13 CRAN (R 3.4.2)                   
 assertthat             0.2.0      2017-04-11 CRAN (R 3.4.2)                   
 backports              1.1.2      2017-12-13 CRAN (R 3.4.2)                   
 base                 * 3.4.2      2017-09-29 local                            
 beeswarm               0.2.3      2016-04-25 CRAN (R 3.4.2)                   
 bindr                  0.1.1      2018-03-13 CRAN (R 3.4.2)                   
 bindrcpp               0.2.2      2018-03-29 CRAN (R 3.4.2)                   
 Biobase              * 2.36.2     2017-10-10 Bioconductor                     
 BiocGenerics         * 0.22.1     2017-10-10 Bioconductor                     
 BiocParallel           1.10.1     2017-10-10 Bioconductor                     
 biomaRt                2.32.1     2017-10-23 Bioconductor                     
 bit                    1.1-12     2014-04-09 CRAN (R 3.4.2)                   
 bit64                  0.9-7      2017-05-08 CRAN (R 3.4.2)                   
 bitops                 1.0-6      2013-08-17 CRAN (R 3.4.2)                   
 blob                   1.1.1      2018-03-25 CRAN (R 3.4.2)                   
 blockmodeling          0.3.0      2018-04-11 CRAN (R 3.4.2)                   
 caTools                1.17.1     2014-09-10 CRAN (R 3.4.2)                   
 checkmate              1.8.5      2017-10-24 CRAN (R 3.4.2)                   
 coda                   0.19-1     2016-12-08 CRAN (R 3.4.2)                   
 codetools              0.2-15     2016-10-05 CRAN (R 3.3.1)                   
 colorspace             1.3-2      2016-12-14 CRAN (R 3.4.2)                   
 compiler               3.4.2      2017-09-29 local                            
 data.table             1.10.4-3   2017-10-27 CRAN (R 3.4.2)                   
 datasets             * 3.4.2      2017-09-29 local                            
 DBI                    0.8        2018-03-02 CRAN (R 3.4.2)                   
 DelayedArray         * 0.2.7      2017-10-10 Bioconductor                     
 devtools             * 1.13.5     2018-02-18 CRAN (R 3.4.2)                   
 digest                 0.6.15     2018-01-28 CRAN (R 3.4.2)                   
 doParallel             1.0.11     2017-09-28 CRAN (R 3.4.2)                   
 doRNG                  1.6.6      2017-04-10 CRAN (R 3.4.2)                   
 dotCall64              0.9-5.2    2018-01-11 CRAN (R 3.4.2)                   
 dplyr                  0.7.4      2017-09-28 CRAN (R 3.4.2)                   
 DT                     0.4        2018-01-30 CRAN (R 3.4.2)                   
 dynamicTreeCut         1.63-1     2016-03-11 CRAN (R 3.4.2)                   
 EBSeq                  1.16.0     2018-04-03 Bioconductor                     
 edgeR                  3.20.9     2018-04-15 Bioconductor                     
 fields                 9.6        2018-01-29 CRAN (R 3.4.2)                   
 FNN                    1.1        2013-07-31 CRAN (R 3.4.2)                   
 foreach                1.4.4      2017-12-12 CRAN (R 3.4.2)                   
 gdata                  2.18.0     2017-06-06 CRAN (R 3.4.2)                   
 GenomeInfoDb         * 1.12.3     2017-10-10 Bioconductor                     
 GenomeInfoDbData       0.99.0     2017-10-10 Bioconductor                     
 GenomicRanges        * 1.28.6     2017-10-10 Bioconductor                     
 ggbeeswarm             0.6.0      2017-08-07 CRAN (R 3.4.2)                   
 ggplot2              * 2.2.1      2016-12-30 CRAN (R 3.4.2)                   
 ggsci                * 2.8        2017-09-30 CRAN (R 3.4.2)                   
 glue                   1.2.0      2017-10-29 cran (@1.2.0)                    
 gplots                 3.0.1      2016-03-30 CRAN (R 3.4.2)                   
 graphics             * 3.4.2      2017-09-29 local                            
 grDevices            * 3.4.2      2017-09-29 local                            
 grid                   3.4.2      2017-09-29 local                            
 gridExtra              2.3        2017-09-09 CRAN (R 3.4.2)                   
 gtable                 0.2.0      2016-02-26 CRAN (R 3.4.2)                   
 gtools                 3.5.0      2015-05-29 CRAN (R 3.4.2)                   
 htmltools              0.3.6      2017-04-28 CRAN (R 3.4.2)                   
 htmlwidgets            1.0        2018-01-20 CRAN (R 3.4.2)                   
 httpuv                 1.3.6.2    2018-03-02 CRAN (R 3.4.2)                   
 igraph                 1.2.1      2018-03-10 cran (@1.2.1)                    
 IRanges              * 2.10.5     2017-10-10 Bioconductor                     
 iterators              1.0.9      2017-12-12 CRAN (R 3.4.2)                   
 KernSmooth             2.23-15    2015-06-29 CRAN (R 3.4.0)                   
 lattice                0.20-35    2017-03-25 CRAN (R 3.3.3)                   
 lazyeval               0.2.1      2017-10-29 CRAN (R 3.4.2)                   
 limma                  3.34.9     2018-04-15 Bioconductor                     
 lme4                   1.1-17     2018-04-03 CRAN (R 3.4.2)                   
 locfit                 1.5-9.1    2013-04-20 CRAN (R 3.4.2)                   
 magrittr               1.5        2014-11-22 CRAN (R 3.4.2)                   
 maps                   3.3.0      2018-04-03 CRAN (R 3.4.2)                   
 MASS                 * 7.3-49     2018-02-23 CRAN (R 3.4.2)                   
 Matrix                 1.2-14     2018-04-09 CRAN (R 3.4.2)                   
 matrixStats          * 0.53.1     2018-02-11 CRAN (R 3.4.2)                   
 mclust                 5.4        2017-11-22 CRAN (R 3.4.2)                   
 memoise                1.1.0      2017-04-21 CRAN (R 3.4.2)                   
 methods              * 3.4.2      2017-09-29 local                            
 mime                   0.5        2016-07-07 CRAN (R 3.4.2)                   
 minqa                  1.2.4      2014-10-09 CRAN (R 3.4.2)                   
 munsell                0.4.3      2016-02-13 CRAN (R 3.4.2)                   
 nlme                   3.1-131    2017-02-06 CRAN (R 3.4.0)                   
 nloptr                 1.0.4      2017-08-22 CRAN (R 3.4.2)                   
 outliers               0.14       2011-01-24 CRAN (R 3.4.2)                   
 parallel             * 3.4.2      2017-09-29 local                            
 pillar                 1.2.1      2018-02-27 CRAN (R 3.4.2)                   
 pkgconfig              2.0.1      2017-03-21 CRAN (R 3.4.2)                   
 pkgmaker               0.22       2014-05-14 CRAN (R 3.4.2)                   
 plyr                   1.8.4      2016-06-08 CRAN (R 3.4.2)                   
 R6                     2.2.2      2017-06-17 CRAN (R 3.4.2)                   
 Rcpp                   0.12.16    2018-03-13 CRAN (R 3.4.2)                   
 RCurl                  1.95-4.10  2018-01-04 CRAN (R 3.4.2)                   
 registry               0.5        2017-12-03 CRAN (R 3.4.2)                   
 reshape2               1.4.3      2017-12-11 CRAN (R 3.4.2)                   
 rhdf5                  2.20.0     2018-03-24 Bioconductor                     
 rjson                  0.2.15     2014-11-03 CRAN (R 3.4.2)                   
 rlang                  0.2.0      2018-02-20 CRAN (R 3.4.2)                   
 rngtools               1.2.4      2014-03-06 CRAN (R 3.4.2)                   
 RSQLite                2.1.0      2018-03-29 CRAN (R 3.4.2)                   
 rstudioapi             0.7        2017-09-07 CRAN (R 3.4.2)                   
 S4Vectors            * 0.14.7     2017-10-10 Bioconductor                     
 scales                 0.5.0      2017-08-24 CRAN (R 3.4.2)                   
 scater                 1.4.0      2018-03-24 Bioconductor                     
 scDD                 * 1.0.0      2018-04-03 Bioconductor                     
 scran                  1.4.5      2018-04-03 Bioconductor                     
 shiny                  1.0.5      2017-08-23 CRAN (R 3.4.2)                   
 shinydashboard         0.7.0      2018-03-21 CRAN (R 3.4.2)                   
 SingleCellExperiment * 1.0.0      2018-04-13 Bioconductor                     
 spam                   2.1-4      2018-04-12 CRAN (R 3.4.2)                   
 splatter             * 1.3.3.9010 2018-04-13 Github (Oshlack/splatter@5380fa9)
 splines                3.4.2      2017-09-29 local                            
 statmod                1.4.30     2017-06-18 CRAN (R 3.4.2)                   
 stats                * 3.4.2      2017-09-29 local                            
 stats4               * 3.4.2      2017-09-29 local                            
 stringi                1.1.7      2018-03-12 CRAN (R 3.4.2)                   
 stringr                1.3.0      2018-02-19 CRAN (R 3.4.2)                   
 SummarizedExperiment * 1.6.5      2017-10-10 Bioconductor                     
 testthat               2.0.0      2017-12-13 CRAN (R 3.4.2)                   
 tibble                 1.4.2      2018-01-22 cran (@1.4.2)                    
 tools                  3.4.2      2017-09-29 local                            
 tximport               1.4.0      2018-03-24 Bioconductor                     
 utils                * 3.4.2      2017-09-29 local                            
 vipor                  0.4.5      2017-03-22 CRAN (R 3.4.2)                   
 viridis                0.5.1      2018-03-29 CRAN (R 3.4.2)                   
 viridisLite            0.3.0      2018-02-01 CRAN (R 3.4.2)                   
 withr                  2.1.2      2018-03-15 CRAN (R 3.4.2)                   
 XML                    3.98-1.9   2017-06-19 CRAN (R 3.4.0)                   
 xtable                 1.8-2      2016-02-05 CRAN (R 3.4.2)                   
 XVector                0.16.0     2017-10-10 Bioconductor                     
 yaml                   2.1.18     2018-03-08 CRAN (R 3.4.2)                   
 zlibbioc               1.22.0     2017-10-10 Bioconductor                     
 zoo                    1.8-1      2018-01-08 CRAN (R 3.4.2)  
lazappi commented 6 years ago

Cool, I will look at merging the changes I made into devel then.

It looks like you have an old version of scDD, the current release version is 1.2.0 and the devel version is 1.3.5. I would try updating that to fix the other issue.

If you run biocLite() without any arguments it should try and update any packages that are out of date.

lazappi commented 6 years ago

The development version including this change is now available on Bioconductor https://bioconductor.org/packages/devel/bioc/html/splatter.html. I'm going to close this issue now but if you have any other question/issues feel free to open a new one or comment on the Bioconductor support forum https://support.bioconductor.org/.