compbiomed / celda

Bayesian Hierarchical Modeling for Clustering Single Cell Genomic Data
MIT License
48 stars 14 forks source link

Expected time for a run & minor bug #275

Closed chlee-tabin closed 6 years ago

chlee-tabin commented 6 years ago

In the celda-analysis vignette,

all seq options, for example seq(5:50, by = 5) should be changed to seq(5,50,by=5) to be able to have the code to be executed.

I am currently following the celda-analysis vignette, and considering cluster (HMS O2 cluster) for running pbmc_select dataset. What are the recommendations for memory requirements and cores to run? And how does the time scale with the number of cells and observations?

Thank you very much!

joshua-d-campbell commented 6 years ago

Hi @chlee-tabin, Thanks for your questions about performance. We are in the process of implementing several enhancements to improve speed and should be finalized in the next few weeks. In general, for memory you should have around 2-3 times the memory of your data matrix. The matrix can be stored as 'integer' and not 'numeric' data type as celda works directly on counts (celda converts to this internally). You can set the data type and see the size of your matrix with the commands:

typeof(matrix) = "integer" print(object.size(matrix), units="GB")

For speed, the time will increase as the number of cells and genes increase as well as with the number of cell and gene clusters you are trying to estimate (i.e. K/L parameters). I ran the 68K PBMC dataset and a single chain took about 2 hours to finish with K=15 and L=25. Increasing the number of chains in the Gibbs sampling will allow for more random initializations of cluster cell and gene cluster labels per K/L combination and thus allow for a more thorough search of the probability space (increasing the chance of finding a higher probability model fit).

Having multiple cores will allow for multiple chains to be run in parallel. It won't increase the speed of individual chains however. For example if you tried 5 combinations of K and 5 combinations of L with nchain=5, that would be a total of 5 x 5 x 5 = 125 different models to be fitted. If you had 25 cores, it would fit 25 models at the same time and decrease the total run time ~25 fold.

chlee-tabin commented 6 years ago

Thank you very much! This is really helpful. Just one more question, typically (1) there are batches and other uninteresting variables that in other singlecell assays people regress out or (2) there are algorithms to impute the matrix, resulting in the matrix not having integers. Is there any way to put the matrix in non-integer units? Or are there other ways to incorporate (1) and/or (2) issues to celda?

joshua-d-campbell commented 6 years ago

Great questions. We are working on methods to handle confounding effects such as batch or individual, but they are not available yet. With this type of model, we don't want to regress out some effects such as number of UMIs/counts per cell as this is inherently handled in the model.

The strength of this type of model is that is can inherently handle sparse integer data by pooling together information across cells to estimate cell populations and/or pooling together gene counts to estimate transcriptional modules. So imputation is not really recommended.

For down-stream analyses such as tSNE, PCA, trajectory analysis, etc, we recommend using the counts from the factorized matrix (i.e. fm$counts$cell.states if 'fm' is the returned object from the 'factorizeMatrix' function). This is a reduced dimensional dataset with one row per transcriptional module (which are less noisy than individual genes) and 1 column per cell. It can usually be normalized and used in the same way as the full counts matrix.