Closed LTLA closed 4 years ago
As described in #99, I have written a new function that pushes back the combining to the last step. This allows us to run SingleR()
to obtain classification results for each reference independently, which provides more flexibility and - in theory - greater speed and accuracy, as each reference doesn't need to include irrelevant genes that might be markers only in other reference.
The new combining step is run at the very end and recomputes the scores in a manner that is (more) comparable across references. For each cell, it identifies the assigned label from each reference and collects all markers for those labels. It then recomputes the score for each assigned label in each reference and chooses the reference that has the highest recomputed score.
While there is still a score calculation across a union of markers, this should be faster and more specific as the union is only taken across the assigned labels for each cell, not across all labels in all references. In effect, we shift the compute time from SingleR()
to combineRecomputedResults
.
In testing with diverse datasets, I find the new and old methods to perform comparably from a quality of scores perspective:
No. | Datasets | refs |
---|---|---|
1 | human sorted naive CD4 and CD8 T cells | HPCA & BlueprintEncode |
2 | human sorted CD34+ hematopoietic stem & progenitor cells | HPCA & BlueprintEncode |
3 | human brain cells dataset 1 | HPCA & BlueprintEncode |
4 | human brain cells dataset 2 | HPCA & BlueprintEncode & manually annotated No. 3 |
No. | # cells in test | Time (old method)*** | Time (new method)*** |
---|---|---|---|
1 | 40,152 | 1648.493 sec | 1131.123 sec |
2 | 5,183 | 283.533 sec | 405.79 sec |
3 | 8227 | 2067.689 sec | 1461.29 sec |
4 | 70,634 | 21954.559 sec | 8382.182 sec |
*** = available memory may not have been equal across runs, and n = 1 for each.
1&4: For quality data where provided refs match the test cells and cell type borders are non-fuzzy, 1 and 4, both methods perform great:
2: For low genes-per-cell data where provided refs match the test cells but borders between cells are fuzzy due to progressive differentiation:
3: For quality data where provided refs did not match the test cells:
Great insights! I also felt that the neuronal reference sets are lacking; i.e. found their performance to be significantly less trustworthy than for immune cells. But then again, I was doing it on brain organoids, which have all kinds of issues, so I was hesitant to place the blame squarely on the reference. Do you already have an idea where to get better neuronal reference data from?
My No. 3 and 4 above can both be used. They both have published annotations =). I just need to nudge my labmate to help me get them organized properly and either into scRNAseq or directly into SingleR
3 = Grubman et al, Nature Neuroscience 2019, https://www.nature.com/articles/s41593-019-0539-4 4 = Mathys et al, Nature 2019, https://www.ncbi.nlm.nih.gov/pubmed/31042697
A reproducible example to test performance. Note that this without the changes Aaron made last night. This is using an everything and the kitchen sink approach, using all the references that contain basically any immune/haematopoietic cells.
library(TENxPBMCData)
library(scater)
library(scran)
library(SingleR)
library(dittoSeq)
# Load data
sce <- TENxPBMCData(dataset = "pbmc4k")
# Basic filtering
per.cell <- perCellQCMetrics(sce, subsets=list(Mito=grep("mt-", rownames(sce))))
qc.stats <- quickPerCellQC(per.cell, percent_subsets="subsets_Mito_percent")
sce <- sce[,!qc.stats$discard]
keep <- nexprs(sce, byrow=TRUE) > 0
sce <- sce[keep, ]
sce <- logNormCounts(sce)
# Get variable genes
dec <- modelGeneVar(sce)
top.hvgs <- getTopHVGs(dec, prop=0.1)
# Dim reductions
sce <- runPCA(sce, subset_row=top.hvgs)
sce <- runUMAP(sce)
output <- getClusteredPCs(reducedDim(sce))
npcs <- metadata(output)$chosen
reducedDim(sce, "PCAsub") <- reducedDim(sce, "PCA")[,1:npcs,drop=FALSE]
npcs
# Clustering
g <- buildSNNGraph(sce, use.dimred="PCAsub")
cluster <- igraph::cluster_walktrap(g)$membership
sce$cluster <- factor(cluster)
rownames(sce) <- rowData(sce)$Symbol_TENx
bp <- BlueprintEncodeData()
mona <- MonacoImmuneData()
dice <- DatabaseImmuneCellExpressionData()
dmap <- NovershternHematopoieticData()
hpca <- HumanPrimaryCellAtlasData()
refs <- list(BP=bp, Monaco=mona, DICE=dice, DMAP=dmap, HPCA=hpca)
fine.labels <- list(bp$label.fine, mona$label.fine, dice$label.fine, dmap$label.fine,
hpca$label.fine)
main.labels <- list(bp$label.main, mona$label.main, dice$label.main,
dmap$label.main, hpca$label.main)
message("Common mode with fine labels and 'single' method:")
system.time(pred.fine.com <- SingleR(test = sce, ref = refs, labels = fine.labels,
method = "single", recompute = FALSE))
message("Recompute mode with fine labels and 'single' method:")
system.time(pred.fine.recomp <- SingleR(test = sce, ref = refs, labels = fine.labels,
method = "single"))
Common mode with fine labels and 'single' method:
user system elapsed 721.188 72.391 793.385 Recompute mode with fine labels and 'single' method:
user system elapsed 484.297 47.078 531.136
message("Common mode with fine labels, 'single' method, and no fine-tuning:")
system.time(pred.fine.com.noft <- SingleR(test = sce, ref = refs, labels = fine.labels,
method = "single", fine.tune = FALSE, recompute = FALSE))
message("Recompute mode with fine labels, 'single' method, and no fine-tuning:")
system.time(pred.fine.recomp.noft <- SingleR(test = sce, ref = refs, labels = fine.labels,
method = "single", fine.tune = FALSE))
Common mode with fine labels, 'single' method, and no fine-tuning:
user system elapsed 211.610 66.437 277.895 Recompute mode with fine labels, 'single' method, and no fine-tuning:
user system elapsed 204.344 42.844 247.159
message("Common mode with main labels and 'single' method:")
system.time(pred.main.com <- SingleR(test = sce, ref = refs, labels = main.labels,
method = "single", recompute = FALSE))
message("Recompute mode with main labels and 'single' method:")
system.time(pred.main.recomp <- SingleR(test = sce, ref = refs, labels = main.labels,
method = "single"))
Common mode with main labels and 'single' method:
user system elapsed 349.578 33.422 382.921 Recompute mode with main labels and 'single' method:
user system elapsed 174.375 21.515 195.874
message("Common mode with main labels, 'single' method, and no fine-tuning:")
system.time(pred.main.com.noft <- SingleR(test = sce, ref = refs, labels = main.labels,
method = "single", fine.tune = FALSE, recompute = FALSE))
message("Recompute mode with main labels, 'single' method, and no fine-tuning:")
system.time(pred.main.recomp.noft <- SingleR(test = sce, ref = refs, labels = main.labels,
method = "single", fine.tune = FALSE))
Common mode with main labels, 'single' method, and no fine-tuning:
user system elapsed 119.812 29.094 148.849 Recompute mode with main labels, 'single' method, and no fine-tuning:
user system elapsed 90.110 18.609 108.645
Overall, the recompute method does speed things up, at least for this dataset. Another observation is that I think it generally provides better granularity compared to the common method.
table(pred.fine.com$pruned.labels)
CD4+ T-cells CD4+ Tcm
546 111
CD4+ Tem CD8+ T-cells
150 384
CD8+ Tcm CD8+ Tem
655 321
Class-switched memory B-cells DC
50 1
GMP Intermediate monocytes
2 10
Megakaryocytes Memory B-cells
4 144
MEP Monocyte:CD16+
5 15
Monocytes Monocytes, CD16+
1049 1
MPP Myeloid dendritic cells
2 81
Naive B cells naive B-cells
1 405
Naive CD4 T cells NK cells
2 190
Non classical monocytes Non-switched memory B cells
16 1
Plasmablasts Plasmacytoid dendritic cells
1 35
Tregs
109
The recompute results:
table(pred.fine.recomp$pruned.labels)
B cells, naive B_cell
1 15
B_cell:immature B_cell:Naive
9 4
CD4+ T-cells CD4+ Tcm
357 95
CD4+ Tem CD8+ T-cells
140 173
CD8+ Tcm CD8+ Tem
489 250
Central memory CD8 T cells Class-switched memory B-cells
17 36
DC Exhausted B cells
1 17
Follicular helper T cells GMP
3 2
Intermediate monocytes MAIT cells
6 10
Megakaryocytes Memory B-cells
3 108
MEP Monocyte:CD14+
4 22
Monocyte:CD16- Monocyte:CD16+
515 136
Monocytes Monocytes, CD14+
337 35
Monocytes, CD16+ MPP
1 1
Myeloid dendritic cells Naive B cells
112 208
naive B-cells Naive CD4 T cells
169 241
Naive CD8 T cells Natural killer cells
64 15
NK cells NK_cell:CD56hiCD62L+
155 10
Non-switched memory B cells Non-Vd2 gd T cells
22 5
Plasmablasts Plasmacytoid dendritic cells
1 35
Pre-B_cell_CD34- Progenitor cells
7 1
Switched memory B cells T cells, CD4+, memory TREG
15 2
T cells, CD4+, naive T cells, CD4+, naive TREG
5 1
T cells, CD4+, TFH T cells, CD8+, naive
9 32
T regulatory cells T_cell:CD4+
17 12
T_cell:CD4+_central_memory T_cell:CD4+_effector_memory
177 24
T_cell:CD4+_Naive T_cell:CD8+
34 14
T_cell:gamma-delta Terminal effector CD8 T cells
1 20
Th1 cells Th1/Th17 cells
10 1
Th17 cells Th2 cells
1 2
Tregs Vd2 gd T cells
62 25
This is somewhat muddied here due to the overlapping/similar labels between the references, but in general, there are fewer cells classified with low-granularity (e.g. CD4+ T cell) using this method. I've also observed this using the cluster
mode (not shown).
Will post images and more info after finishing harmonizing labels between references once I get a chance.
Excellent insights from @dtm2451 and @j-andrews7. It seems that we get a modest speed boost in many cases and accuracy that is, at the very least, no worse than before.
I'll add one more theoretical comment for future reference. As I may have previously mentioned, the inter-reference score calculations are done using the markers for the top scoring label in each reference. That is, we take the markers that distinguish the top-scoring label from other labels in the same reference, and then we take the union of those markers across all references. This approach improves speed by avoiding the use of a union of all markers across all references.
A potential pitfall of this approach is that, in theory, we may not include "negative markers", i.e., genes that are not expressed in the true label for a given cell. Without these negatives, we are computing correlations across only the set of positive markers, equivalent to looking for within-cell-type structure rather than across-cell-type structure. If you want to visualize this, have a look at common depictions of Simpson's paradox where removing a group of points can change the correlation.
In practice, I don't think this is a problem, mostly because we would only fail to include negative markers if all of the top-scoring labels are correct! Well, perhaps it's not so clear-cut, but the most egregious failures of the inter-reference score calculation should only occur when the top-scoring labels are all close enough to the correct cell type that it doesn't really matter all that much.
Anyway, I think I'm pretty happy with this and will merge #99. I have found this division of labor among ourselves to be quite effective and I hope we will be able to continue doing this moving forward; I no longer have the knowledge (and certainly not the time) to wade through datasets to test things out to this depth. See also ab61e15a93b05e94eef5b9471fec7ae3fb0c089f if you want to impress people in a bar.
Listing thoughts here so that I don't forget it. Maybe someone would be willing to help try to implement this, it shouldn't take (much) C++ code.
Context:
Our current approach with dealing with multiple references is to do the assignment within each reference; compare the top non-fine-tuned scores across references; and pick the (possibly fine-tuned) label from the reference with the highest non-fine-tuned score. Most of the work is done within each reference, which is logistically convenient and side-steps problems with batch effects between references. The fact that we compare non-fine-tuned scores is rather unavoidable, because the scores are not comparable after fine-tuning, but that's probably okay for the most part; the goal is just to get us close enough to avoid obviously wrong matches to completely different cell types.
However, a practical difficulty with this approach is that we must always take the union of genes from all references in the initial search for each reference, otherwise the scores are not directly comparable. This does not scale well as we combine references that contain more diverse cell types; it is possible that we will end up including a good chunk of the transcriptome in our initial score calculation. This slows everything down, which is bad enough, but more problematic is the increased risk of irrelevant noise overwhelming the signal. The latter can potentially cause the correct cell type to be discarded before we can even perform fine-tuning that might salvage the situation.
Proposed solution:
The lightest touch is to adjust how we combine results from multiple references. Namely, we perform assignment within each reference using
SingleR()
as before without expanding to the union of all markers. We take the final label (possibly after fine-tuning!) for each cell from each reference; I will refer to this as the set of "top labels". We then do a second round of fine-tuning across those top labels, using the same algorithm as before but on the relevant set of markers.The question here is "what are the relevant set of markers?" If we were comparing, e.g., top labels A1 (A is the reference, 1 is the label) and B2, I would say that this is the union of all genes involved in A1 vs A2, A3, ... Bm and B2 vs B1, B3, ... Bn. This allows us to get a reasonably informative subset of genes without being inundated by batch effects. (There are some failure points where, say, each reference contains only a single cell type so you never get markers that define the differences between cell types... but that seems pretty pathological.)
Despite the fact that we're still taking a union, the number of genes involved is still much lower than the union of all markers across all labels across all references, which should mitigate problems with noise and speed. This approach is also appealing as it builds off exactly the same results from the assignments to individual references, reducing the scope for discrepancies in the combined results. For starters, we would now use the top label after within-reference fine-tuning, getting around the uncomfortable situation that we find ourselves in with the current algorithm.