alanocallaghan / scater

Clone of the Bioconductor repository for the scater package.
https://bioconductor.org/packages/devel/bioc/html/scater.html
94 stars 40 forks source link

Add a `projectReducedDims` function #168

Closed LTLA closed 2 years ago

LTLA commented 2 years ago

Recently came up in discussions with a user, who wanted something like Seurat's ProjectUMAP. The idea is to map new data onto an existing embedding. Kind of like how snifter does it, but for any target embedding without requiring special knowledge.

For general use, this is probably not a great idea, mostly because the new data may contain populations that weren't present in the old data, and so they go... who knows where. It's also slightly tedious in that the user has to effectively maintain two analyses side-by-side, i.e., that using the old data only and that using the new data, rather than having a single analysis with both old and new datasets.

Nonetheless, a projection can be useful in specific cases where the preservation of the existing embedding is non-negotiable. And by that, I mean embeddings that are being used in publications and one doesn't want a new fight with the reviewers.

To this end, a quick and dirty projection function might look like:

# Completely untested
projectReducedDims <- function(old.points, new.points, old.embedding) {
    res <- queryKNN(X = old.points, query = new.points, k = 1)
    new.embedding <- old.embedding[res$index,,drop=FALSE]
    new.embedding
}

Basically, just plonk each new cell at the embedding location of its nearest neighbor in the old dataset, where neighbors are defined according to some low-dimensional space. Users can decide what space they want to use here; for a quick-and-dirty projection, a raw PCA might suffice, but for something more "correct", you could use the MNN-corrected PCs from batchelor.

A more sophisticated approach might take some kind of (weighted) mean across multiple nearest neighbors, rather than just inserting the cell directly at the closest neighbor. This probably will add some jitter that makes it look more realistic. ¯\_(ツ)_/¯

alanocallaghan commented 2 years ago

Seems like a bad idea but also something people would certainly use. I like the weighted average idea, something like

projectReducedDim <- function(old.points, new.points, old.embedding, k = 2) {
    res <- queryKNN(X = old.points, query = new.points, k = k)
    weight <- 1 / res$distance
    weight <- weight / rowSums(weight)
    new.embedding <- sapply(1:ncol(old.embedding), function(i) {
        rowMeans(
            sapply(1:ncol(res$index), function(j) {
                old.embedding[res$index[, j], i] * weight[, j]
            })
        )
    })
    new.embedding
}

I realise there's surely a more elegant way to do the nested loops.

I could also wrap the snifter stuff into scater without a terrible amount of effort.

LTLA commented 2 years ago

Check out https://github.com/LTLA/batchelor/blob/master/R/utils_tricube.R for a tricube-weighted average based on the nearest neighbors. Watch out for problems with distances of zero if you're going to use inverse weights.

alanocallaghan commented 2 years ago

Ooh, perfect. Might just ::: that unless that's deeply frowned on

LTLA commented 2 years ago

Probably best to just copy it over, avoid an explicit dependency on batchelor. It's not too large and it should be easy to drag across the few unit tests just in case.

alanocallaghan commented 2 years ago

Resolved by commit(s) above but feel free to submit feedback/gripes here