Thoughts on a more scalable algorithm for multiple references

LTLA commented 4 years ago

Listing thoughts here so that I don't forget it. Maybe someone would be willing to help try to implement this, it shouldn't take (much) C++ code.

Context:

Our current approach with dealing with multiple references is to do the assignment within each reference; compare the top non-fine-tuned scores across references; and pick the (possibly fine-tuned) label from the reference with the highest non-fine-tuned score. Most of the work is done within each reference, which is logistically convenient and side-steps problems with batch effects between references. The fact that we compare non-fine-tuned scores is rather unavoidable, because the scores are not comparable after fine-tuning, but that's probably okay for the most part; the goal is just to get us close enough to avoid obviously wrong matches to completely different cell types.

However, a practical difficulty with this approach is that we must always take the union of genes from all references in the initial search for each reference, otherwise the scores are not directly comparable. This does not scale well as we combine references that contain more diverse cell types; it is possible that we will end up including a good chunk of the transcriptome in our initial score calculation. This slows everything down, which is bad enough, but more problematic is the increased risk of irrelevant noise overwhelming the signal. The latter can potentially cause the correct cell type to be discarded before we can even perform fine-tuning that might salvage the situation.

Proposed solution:

The lightest touch is to adjust how we combine results from multiple references. Namely, we perform assignment within each reference using SingleR() as before without expanding to the union of all markers. We take the final label (possibly after fine-tuning!) for each cell from each reference; I will refer to this as the set of "top labels". We then do a second round of fine-tuning across those top labels, using the same algorithm as before but on the relevant set of markers.

The question here is "what are the relevant set of markers?" If we were comparing, e.g., top labels A1 (A is the reference, 1 is the label) and B2, I would say that this is the union of all genes involved in A1 vs A2, A3, ... Bm and B2 vs B1, B3, ... Bn. This allows us to get a reasonably informative subset of genes without being inundated by batch effects. (There are some failure points where, say, each reference contains only a single cell type so you never get markers that define the differences between cell types... but that seems pretty pathological.)

Despite the fact that we're still taking a union, the number of genes involved is still much lower than the union of all markers across all labels across all references, which should mitigate problems with noise and speed. This approach is also appealing as it builds off exactly the same results from the assignments to individual references, reducing the scope for discrepancies in the combined results. For starters, we would now use the top label after within-reference fine-tuning, getting around the uncomfortable situation that we find ourselves in with the current algorithm.

LTLA commented 4 years ago

As described in #99, I have written a new function that pushes back the combining to the last step. This allows us to run SingleR() to obtain classification results for each reference independently, which provides more flexibility and - in theory - greater speed and accuracy, as each reference doesn't need to include irrelevant genes that might be markers only in other reference.

The new combining step is run at the very end and recomputes the scores in a manner that is (more) comparable across references. For each cell, it identifies the assigned label from each reference and collects all markers for those labels. It then recomputes the score for each assigned label in each reference and chooses the reference that has the highest recomputed score.

While there is still a score calculation across a union of markers, this should be faster and more specific as the union is only taken across the assigned labels for each cell, not across all labels in all references. In effect, we shift the compute time from SingleR() to combineRecomputedResults.

dtm2451 commented 4 years ago

In testing with diverse datasets, I find the new and old methods to perform comparably from a quality of scores perspective:

No.	Datasets	refs
1	human sorted naive CD4 and CD8 T cells	HPCA & BlueprintEncode
2	human sorted CD34+ hematopoietic stem & progenitor cells	HPCA & BlueprintEncode
3	human brain cells dataset 1	HPCA & BlueprintEncode
4	human brain cells dataset 2	HPCA & BlueprintEncode & manually annotated No. 3

No.	# cells in test	Time (old method)***	Time (new method)***
1	40,152	1648.493 sec	1131.123 sec
2	5,183	283.533 sec	405.79 sec
3	8227	2067.689 sec	1461.29 sec
4	70,634	21954.559 sec	8382.182 sec

*** = available memory may not have been equal across runs, and n = 1 for each.

1&4: For quality data where provided refs match the test cells and cell type borders are non-fuzzy, 1 and 4, both methods perform great:

1, Both lead to 99.9+% T cell calls.
4, Both methods yield proper cell type calls 99+%

2: For low genes-per-cell data where provided refs match the test cells but borders between cells are fuzzy due to progressive differentiation:

Old method accuracy was 51%, new recompute method accuracy was 42%. BUT that accuracy drop came from a swap between ~10% being called 'CLP' ("correct") in the old method and 'Pro-B_cell_CD34+' (which is probably just a B-cell persons biased name for what were actually CLPs) in the new recompute method. It's a hard dataset. My to-be-published method with this data was to make calls on entire clusters based on their maximal per-cell annotations. Taking my domain knowledge into account, both methods would provided the same result here, so I call this still as they both work adequately accurately on hard data.

3: For quality data where provided refs did not match the test cells:

Of cell types existent in the refs (astrocytes, neurons, microglia ~ macrophages), neither does well with astrocytes which are the most abundant. old = 29%, new = 44%.
BUT see No.2, this low accuracy is an issue with the references themselves, not the algorithm. We should (aka "I will") build in a neuronal reference.

friedue commented 4 years ago

Great insights! I also felt that the neuronal reference sets are lacking; i.e. found their performance to be significantly less trustworthy than for immune cells. But then again, I was doing it on brain organoids, which have all kinds of issues, so I was hesitant to place the blame squarely on the reference. Do you already have an idea where to get better neuronal reference data from?

dtm2451 commented 4 years ago

My No. 3 and 4 above can both be used. They both have published annotations =). I just need to nudge my labmate to help me get them organized properly and either into scRNAseq or directly into SingleR

3 = Grubman et al, Nature Neuroscience 2019, https://www.nature.com/articles/s41593-019-0539-4 4 = Mathys et al, Nature 2019, https://www.ncbi.nlm.nih.gov/pubmed/31042697

j-andrews7 commented 4 years ago

A reproducible example to test performance. Note that this without the changes Aaron made last night. This is using an everything and the kitchen sink approach, using all the references that contain basically any immune/haematopoietic cells.

Setup

library(TENxPBMCData)
library(scater)
library(scran)
library(SingleR)
library(dittoSeq)

# Load data
sce <- TENxPBMCData(dataset = "pbmc4k")

# Basic filtering
per.cell <- perCellQCMetrics(sce, subsets=list(Mito=grep("mt-", rownames(sce))))
qc.stats <- quickPerCellQC(per.cell, percent_subsets="subsets_Mito_percent")
sce <- sce[,!qc.stats$discard]

keep <- nexprs(sce, byrow=TRUE) > 0
sce <- sce[keep, ]
sce <- logNormCounts(sce)

# Get variable genes
dec <- modelGeneVar(sce)
top.hvgs <- getTopHVGs(dec, prop=0.1)

# Dim reductions
sce <- runPCA(sce, subset_row=top.hvgs)
sce <- runUMAP(sce)

output <- getClusteredPCs(reducedDim(sce))
npcs <- metadata(output)$chosen
reducedDim(sce, "PCAsub") <- reducedDim(sce, "PCA")[,1:npcs,drop=FALSE]
npcs

# Clustering
g <- buildSNNGraph(sce, use.dimred="PCAsub")
cluster <- igraph::cluster_walktrap(g)$membership
sce$cluster <- factor(cluster)
rownames(sce) <- rowData(sce)$Symbol_TENx

bp <- BlueprintEncodeData()
mona <- MonacoImmuneData()
dice <- DatabaseImmuneCellExpressionData()
dmap <- NovershternHematopoieticData()
hpca <- HumanPrimaryCellAtlasData()

refs <- list(BP=bp, Monaco=mona, DICE=dice, DMAP=dmap, HPCA=hpca)
fine.labels <- list(bp$label.fine, mona$label.fine, dice$label.fine, dmap$label.fine, 
    hpca$label.fine)
main.labels <- list(bp$label.main, mona$label.main, dice$label.main, 
    dmap$label.main, hpca$label.main)

Runs

Fine labels - single method

message("Common mode with fine labels and 'single' method:")
system.time(pred.fine.com <- SingleR(test = sce, ref = refs, labels = fine.labels, 
    method = "single", recompute = FALSE))

message("Recompute mode with fine labels and 'single' method:")
system.time(pred.fine.recomp <- SingleR(test = sce, ref = refs, labels = fine.labels, 
    method = "single"))

Common mode with fine labels and 'single' method:

user system elapsed 721.188 72.391 793.385 Recompute mode with fine labels and 'single' method:

user system elapsed 484.297 47.078 531.136

Fine labels - single method - no fine tuning

message("Common mode with fine labels, 'single' method, and no fine-tuning:")
system.time(pred.fine.com.noft <- SingleR(test = sce, ref = refs, labels = fine.labels, 
    method = "single", fine.tune = FALSE, recompute = FALSE))

message("Recompute mode with fine labels, 'single' method, and no fine-tuning:")
system.time(pred.fine.recomp.noft <- SingleR(test = sce, ref = refs, labels = fine.labels, 
    method = "single", fine.tune = FALSE))

Common mode with fine labels, 'single' method, and no fine-tuning:

user system elapsed 211.610 66.437 277.895 Recompute mode with fine labels, 'single' method, and no fine-tuning:

user system elapsed 204.344 42.844 247.159

Main labels - single method

message("Common mode with main labels and 'single' method:")
system.time(pred.main.com <- SingleR(test = sce, ref = refs, labels = main.labels, 
    method = "single", recompute = FALSE))

message("Recompute mode with main labels and 'single' method:")
system.time(pred.main.recomp <- SingleR(test = sce, ref = refs, labels = main.labels, 
    method = "single"))

Common mode with main labels and 'single' method:

user system elapsed 349.578 33.422 382.921 Recompute mode with main labels and 'single' method:

user system elapsed 174.375 21.515 195.874

Main labels - single method - no fine tuning

message("Common mode with main labels, 'single' method, and no fine-tuning:")
system.time(pred.main.com.noft <- SingleR(test = sce, ref = refs, labels = main.labels, 
    method = "single", fine.tune = FALSE, recompute = FALSE))

message("Recompute mode with main labels, 'single' method, and no fine-tuning:")
system.time(pred.main.recomp.noft <- SingleR(test = sce, ref = refs, labels = main.labels, 
    method = "single", fine.tune = FALSE))

Common mode with main labels, 'single' method, and no fine-tuning:

user system elapsed 119.812 29.094 148.849 Recompute mode with main labels, 'single' method, and no fine-tuning:

user system elapsed 90.110 18.609 108.645

Overall, the recompute method does speed things up, at least for this dataset. Another observation is that I think it generally provides better granularity compared to the common method.

table(pred.fine.com$pruned.labels)

                 CD4+ T-cells                      CD4+ Tcm 
                          546                           111 
                     CD4+ Tem                  CD8+ T-cells 
                          150                           384 
                     CD8+ Tcm                      CD8+ Tem 
                          655                           321 
Class-switched memory B-cells                            DC 
                           50                             1 
                          GMP        Intermediate monocytes 
                            2                            10 
               Megakaryocytes                Memory B-cells 
                            4                           144 
                          MEP                Monocyte:CD16+ 
                            5                            15 
                    Monocytes              Monocytes, CD16+ 
                         1049                             1 
                          MPP       Myeloid dendritic cells 
                            2                            81 
                Naive B cells                 naive B-cells 
                            1                           405 
            Naive CD4 T cells                      NK cells 
                            2                           190 
      Non classical monocytes   Non-switched memory B cells 
                           16                             1 
                 Plasmablasts  Plasmacytoid dendritic cells 
                            1                            35 
                        Tregs 
                          109

The recompute results:

table(pred.fine.recomp$pruned.labels)

               B cells, naive                        B_cell 
                            1                            15 
              B_cell:immature                  B_cell:Naive 
                            9                             4 
                 CD4+ T-cells                      CD4+ Tcm 
                          357                            95 
                     CD4+ Tem                  CD8+ T-cells 
                          140                           173 
                     CD8+ Tcm                      CD8+ Tem 
                          489                           250 
   Central memory CD8 T cells Class-switched memory B-cells 
                           17                            36 
                           DC             Exhausted B cells 
                            1                            17 
    Follicular helper T cells                           GMP 
                            3                             2 
       Intermediate monocytes                    MAIT cells 
                            6                            10 
               Megakaryocytes                Memory B-cells 
                            3                           108 
                          MEP                Monocyte:CD14+ 
                            4                            22 
               Monocyte:CD16-                Monocyte:CD16+ 
                          515                           136 
                    Monocytes              Monocytes, CD14+ 
                          337                            35 
             Monocytes, CD16+                           MPP 
                            1                             1 
      Myeloid dendritic cells                 Naive B cells 
                          112                           208 
                naive B-cells             Naive CD4 T cells 
                          169                           241 
            Naive CD8 T cells          Natural killer cells 
                           64                            15 
                     NK cells          NK_cell:CD56hiCD62L+ 
                          155                            10 
  Non-switched memory B cells            Non-Vd2 gd T cells 
                           22                             5 
                 Plasmablasts  Plasmacytoid dendritic cells 
                            1                            35 
             Pre-B_cell_CD34-              Progenitor cells 
                            7                             1 
      Switched memory B cells    T cells, CD4+, memory TREG 
                           15                             2 
         T cells, CD4+, naive     T cells, CD4+, naive TREG 
                            5                             1 
           T cells, CD4+, TFH          T cells, CD8+, naive 
                            9                            32 
           T regulatory cells                   T_cell:CD4+ 
                           17                            12 
   T_cell:CD4+_central_memory   T_cell:CD4+_effector_memory 
                          177                            24 
            T_cell:CD4+_Naive                   T_cell:CD8+ 
                           34                            14 
           T_cell:gamma-delta Terminal effector CD8 T cells 
                            1                            20 
                    Th1 cells                Th1/Th17 cells 
                           10                             1 
                   Th17 cells                     Th2 cells 
                            1                             2 
                        Tregs                Vd2 gd T cells 
                           62                            25

This is somewhat muddied here due to the overlapping/similar labels between the references, but in general, there are fewer cells classified with low-granularity (e.g. CD4+ T cell) using this method. I've also observed this using the cluster mode (not shown).

Will post images and more info after finishing harmonizing labels between references once I get a chance.

LTLA commented 4 years ago

Excellent insights from @dtm2451 and @j-andrews7. It seems that we get a modest speed boost in many cases and accuracy that is, at the very least, no worse than before.

I'll add one more theoretical comment for future reference. As I may have previously mentioned, the inter-reference score calculations are done using the markers for the top scoring label in each reference. That is, we take the markers that distinguish the top-scoring label from other labels in the same reference, and then we take the union of those markers across all references. This approach improves speed by avoiding the use of a union of all markers across all references.

A potential pitfall of this approach is that, in theory, we may not include "negative markers", i.e., genes that are not expressed in the true label for a given cell. Without these negatives, we are computing correlations across only the set of positive markers, equivalent to looking for within-cell-type structure rather than across-cell-type structure. If you want to visualize this, have a look at common depictions of Simpson's paradox where removing a group of points can change the correlation.

In practice, I don't think this is a problem, mostly because we would only fail to include negative markers if all of the top-scoring labels are correct! Well, perhaps it's not so clear-cut, but the most egregious failures of the inter-reference score calculation should only occur when the top-scoring labels are all close enough to the correct cell type that it doesn't really matter all that much.

Anyway, I think I'm pretty happy with this and will merge #99. I have found this division of labor among ourselves to be quite effective and I hope we will be able to continue doing this moving forward; I no longer have the knowledge (and certainly not the time) to wade through datasets to test things out to this depth. See also ab61e15a93b05e94eef5b9471fec7ae3fb0c089f if you want to impress people in a bar.

SingleR-inc / SingleR