SingleR-inc / SingleR

Clone of the Bioconductor repository for the SingleR package.
https://bioconductor.org/packages/devel/bioc/html/SingleR.html
GNU General Public License v3.0
172 stars 19 forks source link

Dealing with underrepresented cell types across references #250

Closed samgest closed 10 months ago

samgest commented 10 months ago

Hi,

I am trying to train a SingleR classifier (with trainSingleR) with a list of several reference cell-by-gene matrices with their corresponding assigned cell labels in a separate list. Assuming all of these reference matrices have already been filtered to contain the same features, how does the combineRecomputedResults function handle cell type labels only represented in one (or a few) of the references?

I'm sorry if I'm not being clear enough; feel free to ask for any detail needed. Thanks in advance.

LTLA commented 10 months ago

It should be fine.

For each cell in the test dataset, combineRecomputedResults looks at the assigned label in each individual reference. For that label, it pulls out the upregulated markers versus every other label in the same reference. It does this for each reference, and then it takes the union of all the extracted markers across references, and it uses those to compute the correlation. The fact that your label is only present in one reference doesn't really matter (in theory, at least), because the union operation doesn't weight the markers by their frequency of occurrence across the references; you would get the same union of markers regardless of whether your label was present in one reference or across all references.

In practice, it will matter at least a little, because the markers will differ from reference to reference even in the best case. If your label was present in more references, you'd probably be able to collect more markers for that label, and we'd get more information from relevant genes during the correlation calculation. Hopefully the cell type is distinct enough that there's enough signal, even given the small number of markers for that label in the pool.

P.S. The minority marker problem could be solved by doing fine-tuning in the combineRecomputedResults step, as this would drop irrelevant labels and their associated markers; but I didn't get around to implementing it.

P.P.S. I believe that the combineRecomputedResults step can actually handle differences in the feature spaces between references, by just ignoring the missing features when computing the correlation for each reference. But, on the whole, it's probably safest to just make sure that everyone has the same set of features; otherwise the correlations aren't strictly comparable between references, which slightly undermines the whole basis of how SingleR works.