LTLA / batchelor

Clone of the Bioconductor repository for the batchelor package.
https://bioconductor.org/packages/devel/bioc/html/batchelor.html
16 stars 7 forks source link

semi-supervised fastMNN correction #49

Open julien-roux opened 3 months ago

julien-roux commented 3 months ago

(First, thanks Aaron for the development and maintenance of this awesome package!)

After reading this preprint, I was wondering if there would be the possibility for such a semi-supervised correction with fastMNN()?

For example filtering MNN pairs could be done based on the prior annotation of different batches, based on the labels inferred from a SingleR run, based on the matching clusters after a clusterMNN() run... What do you think?

LTLA commented 3 months ago

I'm trying to remember, but a long time ago, we might have had similar thoughts. It would be theoretically easy to implement; just restrict the MNN pair formation to cell populations with the same annotation across batches and proceed with the rest of the algorithm. I could see how this could improve correction performance by avoiding the formation of MNN pairs between the wrong populations.

In practice, this was less useful than it seemed. People don't usually come into the analysis with existing annotations for the individual batches, at least not for their own experimental data. After all, the whole point of the batch correction step is to get everything on the same coordinate system so that you only have to do clustering and annotation once; if we already had consistent labels for each batch, we would never need to compute corrected values for the rest of our analysis. Other than to generate artworks like UMAP/t-SNE, perhaps, but I don't think those have much scientific value.

I guess that this functionality might have some appeal for secondary analyses of published datasets that have already been annotated. However, this leads to another problem, which is the harmonization of labels across datasets from different authors. Some poor soul has to go through each combination of datasets and decide which labels match up between them; easy enough for the major cell types, but difficult for the more ambiguous subtypes that might have differing terminology/definitions across the community. Making a mistake here would encourage the formation of the wrong MNN pairs - and frankly, if you already know which cell types match up between datasets, you can probably just proceed with the rest of your meta-analysis without computing corrected values (artistic endeavors aside).

In the end, I must have decided against putting in this functionality. Nonetheless, batchelor still contains a vestige of this line of thought, in the form of the restrict= argument to some of the functions. This was put there when I thought cell controls were going to definitely be a thing; it restricts the MNN pair formation to the control subpopulation within each batch, thus encouraging more accurate correction by focusing on the controls that must be the same across batches. Nowadays I think it's fair to say that no one cared about cell controls and restrict= was not a helpful option.

julien-roux commented 3 months ago

Yes I think you have fair points, thanks for your input!