catavallejos / BASiCS

BASiCS: Bayesian Analysis of Single-Cell Sequencing Data. This is an unstable experimental version. Please see http://bioconductor.org/packages/BASiCS/ for the official release version
83 stars 16 forks source link

Theta doesn't converge #235

Open binyaminZ opened 2 years ago

binyaminZ commented 2 years ago

Hi! I'm running BASiCS with a large dataset, with 20-30K cells per sample. For noise quantification, I down-sampled it to ~5K cells. This data has no spike-ins. If I understand correctly, batch information is somehow used instead of spike-ins. The problem is that in my MCMC chains all parameters converge nicely (with N = 20000, Thin = 20, Burn = 10000) except for Theta, which obviously doesn't and is approaching Zero. I don't care so much about this, it seems like my replicates are very similar and basically there is no batch contribution to mean or noise. May this have some negative impact on the estimates of the other parameters? Or can I ignore this?

Finally, I would like to run BASiCS on the complete data set. Problem is that the down-sampled dataset runs for ~70 hours, and doing the same for the complete dataset is very long. What may help? Decreasing the number of genes quantified? Splitting the two batches into smaller groups (probably not optimal for many genes with sparse expression)?

Thanks!

alanocallaghan commented 2 years ago

I'm not sure what's going on with theta in your example. I'll try to get back to you on that, but in the meantime here's an example of some example code that will run MCMC across different subsets of genes and combine the results - it should be a good deal faster

library("BASiCS")
library("BiocParallel")
## register the number of cores here as normal with biocparallel
## you cant use multicoreparam currently with BASiCS
register(SnowParam(workers = 2))

## make some mock data
sce <- BASiCS_MockSCE(NGenes = 4000, NCells = 100)

## fit a model normally
fit_joint <- BASiCS_MCMC(sce,
    ## N,Thin,Burn here are far too low, as this is just for a demonstration
    N = 2000, Thin = 10, Burn = 1000, Regression = TRUE, WithSpikes = TRUE
)

## fit a model across 4 cores, each with a subset of the genes
fit_split <- BASiCS_MCMC(sce,
    ## N,Thin,Burn here are far too low, as this is just for a demonstration
    N = 2000, Thin = 10, Burn = 1000, Regression = TRUE, WithSpikes = TRUE,
    ## split the data into batches with 1000 genes each
    SubsetBy = "gene", NSubsets = 4
)
## compare
BASiCS_ShowFit(fit_joint)
BASiCS_ShowFit(fit_split)
binyaminZ commented 2 years ago

See here convergence plots: Mu- image Delta- image s- image nu- image Theta- image

catavallejos commented 2 years ago

Hi @binyaminZ. You mention that you have 20-30K cells per sample. Is there a strong sub-structure in the data (i.e. do you have very distinct sub-types within those cells). BASiCS is designed for supervised analysis in which you have sorted distinct populations a priori and therefore strong substructure will break some of the model assumptions. If that's the case, perhaps you can run BASiCS separately for each distinct cell type? That will also speed up things a lot as you can run them in parallel.

binyaminZ commented 2 years ago

Hi @catavallejos, Thanks for your response! No, this is a very uniform population of Jurkat cells, and we are studying the effect of a certain treatment on noise. Dimensionality reductions (both UMAP and tSNE) show very nicely that there is no structure, just one population after some very basic filtering.

binyaminZ commented 2 years ago

Dear @Alanocallaghan, trying to apply the parallelization code you suggested, I have a few questions: a. if you register 2 workers but have 4 subsets, it means each core will run 2 consecutive jobs. does this still provide an advantage over subsetting into 2 subsets in terms of time? b. don't I need to supply Threads = ncors parameter for BASiCS_MCMC? how does it evaluate the registered cores (by using detectCores() or availableCores() or something else directly seeing what I registered)? c. what would be the minimal size you suggest for each subset? can I go down to 500, or 1000 is better? I probably get worse per-cell parameters as I reduce the gene batch size, not sure to what extent BASiCS_MCMC can integrate all the subsets after running in parallel.

Thanks! Binyamin

alanocallaghan commented 2 years ago

a. if you register 2 workers but have 4 subsets, it means each core will run 2 consecutive jobs. does this still provide an advantage over subsetting into 2 subsets in terms of time?

That's right, and no it probably won't provide any benefit over just doing 2 subsets. That's because you'll still just have two cores running sequentially through basically the same amount of work.

b. don't I need to supply Threads = ncors parameter for BASiCS_MCMC? how does it evaluate the registered cores (by using detectCores() or availableCores() or something else directly seeing what I registered)?

I don't know how well Threads will work with the divide and conquer approach above, honestly to be safe I'd just set Threads = 1 and use the workers for parallelisation.

c. what would be the minimal size you suggest for each subset? can I go down to 500, or 1000 is better? I probably get worse per-cell parameters as I reduce the gene batch size, not sure to what extent BASiCS_MCMC can integrate all the subsets after running in parallel.

The main issue I saw when prototyping and testing this stuff was that estimating the mean vs overdispersion trend becomes very unstable with a small amount of genes. The code tries to divide into batches semi-intelligently, but even so if you have only a few hundred genes the trend is very noisy. Based on what I saw in testing, 500 genes per subset is about the minimum, but if you can run with 1000, then that should be a bit more reliable.

binyaminZ commented 2 years ago

Thanks!! I get it better now. (I have no experience with parallel processing in R, sorry for the confusion) another thing that seems like a bug - when I specify the subsets for parallelization, I get the following error message:

basics_results = BASiCS_MCMC(data, N = 200, Thin = 20, Burn = 100, WithSpikes = FALSE, 
                             Regression = F, PrintProgress = FALSE,
                             StoreChains = TRUE, RunName = samp,
                             SubsetBy = "gene", NSubsets = ncors)
Error: BiocParallel errors
  8 remote errors, element index: 1, 2, 3, 4, 5, 6, ...
  0 unevaluated and other errors
  first remote error: object 'samp' not found

samp is a variable with my sample name, and it's present in my environment, but somehow it's not passed to the workers. when I use a quoted expression (like "my_sample"), the process is successful with no errors.

alanocallaghan commented 2 years ago

That's interesting! I don't know for sure what's going on there as different BiocParallel backends can be a bit odd. Can you post the rest of your code (just the stuff setting up BiocParallel and BASiCS)?

binyaminZ commented 2 years ago
library(BASiCS)
library(Seurat)
library(data.table)
library("BiocParallel")
library(parallelly)

ncors = availableCores()
register(SnowParam(workers = ncors))

my_strsplit = function(x, sep, n)
{
  x = unlist(lapply(strsplit(x, sep), "[", n))
  return(x)
}

samp = commandArgs(trailingOnly=TRUE)

data = fread(paste0(samp, "_filtered_counts2.csv"),data.table = F, select = 1:150)
rownames(data) = data$V1
data = data[,-1]
data = BASiCS_Filter(as.matrix(data),MinAvCountsPerCellsWithExpression = 1.5)$Counts

repN = sub("rep", "", my_strsplit(colnames(data),"_",3))
data = SingleCellExperiment(assays = list(counts = data), 
                            colData = data.frame(BatchInfo = repN))
samp = paste0(samp, "_test")
basics_results = BASiCS_MCMC(data, N = 200, Thin = 20, Burn = 100, WithSpikes = FALSE, 
                             Regression = F, PrintProgress = FALSE,
                             StoreChains = TRUE, RunName = samp,
                             SubsetBy = "gene", NSubsets = ncors)
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] parallelly_1.26.1           BiocParallel_1.28.3         data.table_1.14.0          
 [4] SeuratObject_4.0.2          Seurat_4.0.3                BASiCS_2.7.5               
 [7] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0 Biobase_2.54.0             
[10] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1         IRanges_2.28.0             
[13] S4Vectors_0.32.3            BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
[16] matrixStats_0.61.0         

loaded via a namespace (and not attached):
  [1] snow_0.4-4                plyr_1.8.6                igraph_1.2.11            
  [4] lazyeval_0.2.2            splines_4.1.0             listenv_0.8.0            
  [7] scattermore_0.7           ggplot2_3.3.5             digest_0.6.29            
 [10] htmltools_0.5.2           viridis_0.6.2             fansi_1.0.2              
 [13] magrittr_2.0.1            ScaledMatrix_1.2.0        tensor_1.5               
 [16] cluster_2.1.2             ROCR_1.0-11               limma_3.50.1             
 [19] globals_0.14.0            spatstat.sparse_2.0-0     colorspace_2.0-3         
 [22] ggrepel_0.9.1             dplyr_1.0.7               crayon_1.5.0             
 [25] RCurl_1.98-1.6            jsonlite_1.8.0            hexbin_1.28.2            
 [28] spatstat.data_2.1-0       survival_3.2-11           zoo_1.8-9                
 [31] glue_1.4.2                polyclip_1.10-0           gtable_0.3.0             
 [34] zlibbioc_1.40.0           XVector_0.34.0            leiden_0.3.8             
 [37] DelayedArray_0.20.0       BiocSingular_1.10.0       future.apply_1.7.0       
 [40] abind_1.4-5               scales_1.1.1              DBI_1.1.1                
 [43] edgeR_3.36.0              miniUI_0.1.1.1            Rcpp_1.0.8               
 [46] viridisLite_0.4.0         xtable_1.8-4              reticulate_1.20          
 [49] spatstat.core_2.2-0       dqrng_0.3.0               rsvd_1.0.5               
 [52] metapod_1.2.0             htmlwidgets_1.5.4         httr_1.4.2               
 [55] RColorBrewer_1.1-2        ellipsis_0.3.2            ica_1.0-2                
 [58] pkgconfig_2.0.3           scuttle_1.4.0             uwot_0.1.10              
 [61] deldir_0.2-10             locfit_1.5-9.4            utf8_1.2.2               
 [64] tidyselect_1.1.1          rlang_0.4.11              reshape2_1.4.4           
 [67] later_1.3.0               munsell_0.5.0             tools_4.1.0              
 [70] generics_0.1.0            ggridges_0.5.3            stringr_1.4.0            
 [73] fastmap_1.1.0             goftest_1.2-2             fitdistrplus_1.1-5       
 [76] purrr_0.3.4               RANN_2.6.1                pbapply_1.4-3            
 [79] future_1.21.0             nlme_3.1-152              sparseMatrixStats_1.6.0  
 [82] mime_0.12                 scran_1.22.1              ggExtra_0.9              
 [85] compiler_4.1.0            plotly_4.9.4.1            png_0.1-7                
 [88] spatstat.utils_2.2-0      tibble_3.1.6              statmod_1.4.36           
 [91] stringi_1.7.6             lattice_0.20-44           bluster_1.4.0            
 [94] Matrix_1.3-3              vctrs_0.3.8               pillar_1.7.0             
 [97] lifecycle_1.0.1           spatstat.geom_2.2-0       lmtest_0.9-38            
[100] RcppAnnoy_0.0.18          BiocNeighbors_1.12.0      cowplot_1.1.1            
[103] bitops_1.0-7              irlba_2.3.5               httpuv_1.6.5             
[106] patchwork_1.1.1           R6_2.5.1                  promises_1.2.0.1         
[109] KernSmooth_2.23-20        gridExtra_2.3             codetools_0.2-18         
[112] MASS_7.3-54               assertthat_0.2.1          sctransform_0.3.2        
[115] GenomeInfoDbData_1.2.7    mgcv_1.8-36               parallel_4.1.0           
[118] grid_4.1.0                rpart_4.1-15              beachmat_2.10.0          
[121] tidyr_1.1.3               coda_0.19-4               DelayedMatrixStats_1.16.0
[124] Rtsne_0.15                shiny_1.7.1              
alanocallaghan commented 2 years ago

Sorry for the delay getting back to this - for future reference it's easier to debug stuff like this if you give a reproducible example (ie something I can run locally without modifying, and not something that references files you have on your computer that I don't have access to). I'll patch out the code that causes this error in the next Bioc version

As a quick fix, you can leave out the StoreChains and RunName arguments, and save the object to disk yourself like this:

library("BASiCS")
library("Seurat")
library("data.table")
library("BiocParallel")
library("parallelly")

ncors <- availableCores()
register(SnowParam(workers = ncors))

data <- BASiCS_MockSCE()

samp <- "Run1"

basics_results <- BASiCS_MCMC(
    data, N = 200, Thin = 20, Burn = 100, WithSpikes = FALSE, 
    Regression = FALSE, PrintProgress = FALSE,
    SubsetBy = "gene", NSubsets = ncors
)
saveRDS(basics_results, paste0(samp, ".rds"))