BayraktarLab / cell2location

Comprehensive mapping of tissue cell architecture via integrated single cell and spatial transcriptomics (cell2location model)
https://cell2location.readthedocs.io/en/latest/
Apache License 2.0
324 stars 58 forks source link

Size of the single cell dataset #298

Open Jaimomar99 opened 1 year ago

Jaimomar99 commented 1 year ago

Hey, thanks a lot for this wonderful package. I've been experimenting with it, and the results have been very accurate according to the pathologists.

I have a question about the size of the single-cell dataset. I guess that the bigger the dataset, the more data and the better would be the results. However, I'm unsure about the optimal number of cells I should aim for when generating a single-cell dataset. I'm trying to find the right balance between performance and cost-effectiveness. Of course there is not correct answer, but if you could provide some insights I will appreciate it

For instance, would a dataset with 4k cells be sufficient, or should I aim for a larger number? Are there any research papers or methods that explore this correlation? Jaime

vitkl commented 1 year ago

You need to aim for sufficient representation of your populations of interest. In some rare cases, very few cells can suffice to define informative reference gene expression signatures (eg 10 cells). However, I would generally recommend 100s of cells per population of interest (at least 40-50 rarer cells). For an atlas of one tissue, you can get a decent reference with 40k cells (eg. mouse brain data in our paper).