Pruned labels doubts - Githubissues

MartaBenegas commented 4 months ago

Hi SingleR Team!

I'm quite confused with the pruned.labels output. I would expect it to contain the candidate labels that finally have not been assigned to the cell after the fine-tuning.

However, for my dataset I seem to have the same labels and pruned.labels:

Additionally, in your vignette it says:

SingleR() will also report the pruned scores automatically in the pruned.labels field where low-quality assignments are replaced with NA.

So, if a label has been pruned, an NA is assigned on the pruned.labels? Shouldn't the NA be present in the labels column? I would expect that, if an assignment is not reliable, no cell label should be present in the labels column.

Moreover, I've seen in your vignette and in this youtube tutorial that the output should contain a column of first.labels and pruned.scores as well, but I don't receive those columns in the output, even if I explicitly say to run fine-tuning and prunning:

> prediction <- SingleR(test=query.counts, ref=se, labels=se@colData[,cellType], de.method="wilcox", prune = TRUE, fine.tune = TRUE)
> pred.annot <- prediction$labels
> prediction
DataFrame with 7800 rows and 4 columns
                                                      scores                 labels delta.next          pruned.labels
                                                    <matrix>            <character>  <numeric>            <character>
SAMEA7876155-ACGCCGAGTTCGAATC 0.220834:0.534054:0.294939:... CD14-positive monocyte  0.0768004 CD14-positive monocyte
SAMEA7876155-ACGTCAACACCAACCG 0.234576:0.500540:0.295685:... CD14-positive monocyte  0.0620378 CD14-positive monocyte
SAMEA7876155-AGATTGCCAGTCTTCC 0.231756:0.535370:0.273136:... CD14-positive monocyte  0.0816436 CD14-positive monocyte
SAMEA7876155-ATAACGCTCCTAAGTG 0.217160:0.543401:0.257181:... CD14-positive monocyte  0.0509150 CD14-positive monocyte
SAMEA7876155-ATAAGAGAGAAACGCC 0.226850:0.535741:0.279136:... CD14-positive monocyte  0.2992072 CD14-positive monocyte
...                                                      ...                    ...        ...                    ...
SAMEA7876156-CTTAACTCAGGATCGA 0.252072:0.297616:0.269206:...           promyelocyte  0.0939441           promyelocyte
SAMEA7876156-CTTAGGAAGGGAAACA 0.285303:0.356261:0.299967:...           promyelocyte  0.0751877           promyelocyte
SAMEA7876156-GTATCTTAGCCTTGAT 0.259048:0.341265:0.302921:...        progenitor cell  0.2192205        progenitor cell
SAMEA7876156-TACTTACTCCTGTACC 0.269160:0.347660:0.282335:...        progenitor cell  0.3904148        progenitor cell
SAMEA7876156-TAGTTGGTCGATCCCT 0.219187:0.300379:0.274991:...           promyelocyte  0.2835777           promyelocyte

Moreover, in your vignette you mention those columns but they do not appear in the example:

Under which circumstances are the columns first.labels and pruned.socres shown or not shown? Additionally, what's the difference between the fine-tuning and the prunning?

Sorry if I missed something!

j-andrews7 commented 4 months ago

So, if a label has been pruned, an NA is assigned on the pruned.labels?

This is correct.

Shouldn't the NA be present in the labels column? I would expect that, if an assignment is not reliable, no cell label should be present in the labels column.

The labels column contains the best guess. If you want a better idea of ambiguity, use the pruned.labels.

Under which circumstances are the columns first.labels and pruned.socres shown or not shown?

The vignette is seemingly just out of date, and these columns are no longer returned. The vignette was superseded by a book, but it's not currently live due to build issues. The function docstrings (e.g. ?SingleR) are reliable sources.

Additionally, what's the difference between the fine-tuning and the prunning?

Fine-tuning is a way to improve resolution of closely related labels, e.g. subtypes of a broader cell type. Those subtypes will have less separation between them and will all score well when compared to all other labels. When used, fine-tuning will take those labels that score well and perform another round of marker finding and scoring for just those labels to determine the highest scorer among them, which will then be returned as the label.

Pruning is used to prevent erroneous labels of cell types not well represented in the reference dataset. For example, if your dataset has a population of macrophages that are not present in the reference being used, they may score relatively equally for a number of the labels present. But in comparison to other cell types in your test set that are represented, they'll have poor separation between their top score and the next best. As such, they get labeled as ambiguous/low confidence. You can read ?pruneScores for more details about the process.

LTLA commented 4 months ago

Shouldn't the NA be present in the labels column? I would expect that, if an assignment is not reliable, no cell label should be present in the labels column.

Also some historical context: the original version of SingleR didn't do any pruning in the reported labels, and to avoid introducing a change in results for active users, we kept the unpruned labels in labels. In addition, I would say the pruning itself is... just okay. It's our best guess for what is a "bad" assignment, but it's hard to say it with much certainty because all of these things are relative. At least I wasn't confident enough in the pruning to force it on everyone else.

LTLA commented 4 months ago

Moreover, in your vignette you mention those columns but they do not appear in the example:

Yes, as @j-andrews7 mentioned, I just forgot to update the vignette when we switched implementations. The new implementation is based on the singlepp C++ library and doesn't provide the pre-fine-tuning labels by default. If these are needed, you could just set fine.tune=FALSE to compute them explicitly.

MartaBenegas commented 4 months ago

@j-andrews7 @LTLA thank you very much for your detailed explanations! They helped a lot.

Just to double-check: then, it is expected that the labels and the pruned.labels columns contain the same cell labels, but the pruned.labels will contain NA in case the label has been pruned, right?

At first, I misunderstood the pruned.labels column and I thought it contained what would be present in the old first.labels column.

dtm2451 commented 4 months ago

Third eyes now =)

Just to double-check: then, it is expected that the labels and the pruned.labels columns contain the same cell labels, but the pruned.labels will contain NA in case the label has been pruned, right?

correct! pruned.labels is the more confident set where values are either NA or the same as in labels.

MartaBenegas commented 4 months ago

Crystal clear! Thanks again :D I'll close the issue now.

SingleR-inc / SingleR

Pruned labels doubts #262