Closed chlee-tabin closed 6 years ago
Hi @chlee-tabin, Thanks for your questions about performance. We are in the process of implementing several enhancements to improve speed and should be finalized in the next few weeks. In general, for memory you should have around 2-3 times the memory of your data matrix. The matrix can be stored as 'integer' and not 'numeric' data type as celda works directly on counts (celda converts to this internally). You can set the data type and see the size of your matrix with the commands:
typeof(matrix) = "integer"
print(object.size(matrix), units="GB")
For speed, the time will increase as the number of cells and genes increase as well as with the number of cell and gene clusters you are trying to estimate (i.e. K/L parameters). I ran the 68K PBMC dataset and a single chain took about 2 hours to finish with K=15 and L=25. Increasing the number of chains in the Gibbs sampling will allow for more random initializations of cluster cell and gene cluster labels per K/L combination and thus allow for a more thorough search of the probability space (increasing the chance of finding a higher probability model fit).
Having multiple cores will allow for multiple chains to be run in parallel. It won't increase the speed of individual chains however. For example if you tried 5 combinations of K and 5 combinations of L with nchain=5, that would be a total of 5 x 5 x 5 = 125 different models to be fitted. If you had 25 cores, it would fit 25 models at the same time and decrease the total run time ~25 fold.
Thank you very much! This is really helpful. Just one more question, typically (1) there are batches and other uninteresting variables that in other singlecell assays people regress out or (2) there are algorithms to impute the matrix, resulting in the matrix not having integers. Is there any way to put the matrix in non-integer units? Or are there other ways to incorporate (1) and/or (2) issues to celda?
Great questions. We are working on methods to handle confounding effects such as batch or individual, but they are not available yet. With this type of model, we don't want to regress out some effects such as number of UMIs/counts per cell as this is inherently handled in the model.
The strength of this type of model is that is can inherently handle sparse integer data by pooling together information across cells to estimate cell populations and/or pooling together gene counts to estimate transcriptional modules. So imputation is not really recommended.
For down-stream analyses such as tSNE, PCA, trajectory analysis, etc, we recommend using the counts from the factorized matrix (i.e. fm$counts$cell.states if 'fm' is the returned object from the 'factorizeMatrix' function). This is a reduced dimensional dataset with one row per transcriptional module (which are less noisy than individual genes) and 1 column per cell. It can usually be normalized and used in the same way as the full counts matrix.
In the celda-analysis vignette,
all
seq
options, for exampleseq(5:50, by = 5)
should be changed toseq(5,50,by=5)
to be able to have the code to be executed.I am currently following the
celda-analysis
vignette, and considering cluster (HMS O2 cluster) for runningpbmc_select
dataset. What are the recommendations for memory requirements and cores to run? And how does the time scale with the number of cells and observations?Thank you very much!