Reference set contains only very few AT1 and AT2 cells, but SingleR assigned most of cells in test dataset to AT1 and AT2 cells, is it statistically and computationally reasonable?

SingleR-inc / SingleR

Clone of the Bioconductor repository for the SingleR package.

GNU General Public License v3.0

177 stars 19 forks source link

My the reference dataset contains very few AT1 cells and AT2 cells (around 1% of total number of cells), however, in the output of SIngleR, cells which are annotated as AT1 cells and AT2 cells dominates in the test set, almost 30%, I wonder whether this is statistically and computationally reasonable?

I know SingleR choose the "best" label for a cell by iteratively filtering out the bad candidate labels until two labels left. From this point of view, it looks like the case where SingleR identifies more AT1 cells and AT2 cells than my expectation (in terms of the proportion) is reasonable. But I still would like to have some input from you guys, it would be very helpful to have your feedback on this case.

Thanks!

In general, the proportional composition of the reference and the composition of the test do not have to be the same, or even particularly similar, as long as the cell types in the reference are a superset of those in the test. If my reference contained PBMCs and my test contained some purified subtype, then a difference in composition is not surprising.

That said, if you're expecting the reference and test to be similar, then there is some cause for concern. I would run through some of the diagnostics in the book and check that the predicted AT1 cells are expressing some sensible markers for AT1, based on the de.genes and some biological knowledge of what defines an AT1 cell. That should indicate whether the labels can be trusted.

SingleR-inc / SingleR

Reference set contains only very few AT1 and AT2 cells, but SingleR assigned most of cells in test dataset to AT1 and AT2 cells, is it statistically and computationally reasonable? #191