brianhie / geosketch

Geometry-preserving random sampling
http://geosketch.csail.mit.edu
MIT License
75 stars 11 forks source link

Determining optimal sampling rate #6

Closed gdagstn closed 5 years ago

gdagstn commented 5 years ago

Hi,

I have been playing around with geosketch with the aim of speeding up the creation of large single cell reference panoramas with scanorama.

I am using the first 20 components from PCA (in the Seurat implementation after SCTransform) to define the transcriptomic space and so far I'm quite happy with the results.

Here you can see a test I'm doing on the Lake et al. human snRNAseq brain dataset. It's a small set compared to the Saunders et al. mouse brain, but it's part of the datasets I will integrate; it may not make sense to sketch an already small sized dataset, but I just wanted to practice on this.

movie_umap movie_barplot

First question: is there an advantage in using geosketch on a relatively high-dimensional reduction such as the first 20 components in PCA, over using it on the 2 dimensions of the UMAP reduction (as returned by the Seurat implementation)?

Second question Eventually I am going to integrate a sizeable number of datasets, and I was wondering how to determine the "sweet spot" in the trade-off between a small sampling rate (less cells -> faster downstream batch correction/stitching) and representativity (enough cells per cell type, enough "marker" genes expressed at comparable levels, reproducibility of clusters, etc).

I reckon that there are some possible empirical solutions, each of them making a bunch of assumptions:

1) determine a sampling rate so that the smallest cluster still retains a minimum, arbitrary number of cells

2) choice of an arbitrary partial Hausdorff1 cutoff: I tried moving along 10% to 99% sampling rate and using 10 iterations with q = 1e-3 (set it higher than your suggestion to account for a smaller number of cells), the results on partial HD are here:

Lake_partialHD_10reps

I was somehow expecting the curve to reduce its steepness when approaching higher sampling %, but I guess this does not happen because of the small sample size?

3) same as in 2), but using BAMI or any other measure of consistency between clusterings, as in your paper

4) same as in 2), but using consistency of marker finding between full and sketched datasets (may be biased towards high sampling rates)

5) same as in 2), but using consistency of per-cluster and per-dataset dispersion estimation (may be biased towards high sampling rates)

Do you have any other suggestions? I know you applied many metrics to determine how low one could possibly go in the sampling rate without losing too much information, but it seems that the performance of the sketching procedure varies across datasets and sampling rates (fig. 3D of your paper), maybe according to sample complexity and/or sample size.

Many thanks in advance and thanks for the very nice tool.

1: This is my R implementation of the partial HD, written by editing the pracma::hausdorff_dist function:


partial_hd <- function(X, S, q)
{
    stopifnot(is.numeric(X), is.numeric(S), is.numeric(q))
    if (is.vector(X))
        X <- matrix(X, ncol = 1)
    if (is.vector(S))
        S <- matrix(S, ncol = 1)
    if (ncol(X) != ncol(S))
        stop("'X' and 'S' must have the same number of columns.")
    if (q > 1 | q < 0)
        stop("q must be between 0 and 1 inclusive.")
    K = floor((1 - q) * dim(X)[1])
    D <- pracma::distmat(X, S)
    min_XS <- apply(D, 1, min)
    min_XS <- min_XS[order(min_XS, decreasing = FALSE)]
    return(min_XS[K])
}

In the nomenclature it assumes that you want to calculate the distance of the full set X to the sketch S, although in the Huttenlocher paper partial HD is calculated from S to X - should not make a big difference although the opposite calculation may need a different value of q when |X| >> |S|

brianhie commented 5 years ago

Hi @gdagstn,

First off, those GIFs are super cool. Can I reuse them when giving talks or presenting geosketch, with appropriate credit to you of course?

First question:

I'd recommend sticking with PCA, since UMAP, t-SNE, etc. do some pretty substantial amounts of density distortion. A somewhat nice part of the geosketch algorithm on scaled PCs, which is a consequence of using the same hypercube length across all dimensions, is that the contribution of potentially many of the lower variance PCs will be automatically ignored by the algorithm. So even if you sketch using the top 20 or 100 PCs, the "effective" number of PCs will actually be much smaller (empirically we've seen this to be around 5, or even less, on scRNA-seq datasets).

Second question:

Choosing the "right" sketch size is a challenging task, largely in part because it's difficult to say what is "right" in a general sense. After reasoning about this, we decided to make the sketch size a parameter set by the user, hoping it would be motivated by some downstream application or even external resource constraints. As you mentioned, even more formally motivated analysis (e.g., the Hausdorff distance or a Chernoff bound on some distributional statistic) have underlying assumptions that may or may not be best for specific applications.

All of your empirical solutions to finding a good sketch size are definitely reasonable. Another way to think about the sketch size, if your end goal is integration, is to pick a sketch size that does not diminish the "quality" of an integrative transformation too much (or perhaps even improves the "quality"). Again, choosing the right metrics for quantifying integration quality is also somewhat of an art, but at least you'll have some way to relate the sketch size parameter to your intended application.

Another thing to consider is choosing a sketch size that fits within your computational resource budget. Not sure how much compute resources you have access to vs. the size of the integration you want to accomplish, but you can, for example, set a resource cap and choose sketch sizes that way (e.g., integration takes no longer than 24 hours).

Great to hear from you and glad the tool is helpful!

gdagstn commented 5 years ago

Hi Brian, I'm glad you like the GIFs and sure feel free to use them, thanks for the credit. Should you want to reproduce them this is the code I used:


library(ggplot2)
library(Seurat)
library(scater)
library(reticulate)
library(colorspace)

geosketch <- import("geosketch")

#function to use file names that will be ordered correctly in the shell

zeropad <- function(numbers){
    mn <- nchar(as.character(max(numbers)))
    pads <- mn + 1
    for(i in 1:length(numbers)) numbers[i] <- paste0(paste0(rep(0,  pads - nchar(as.character(numbers[i]))), collapse = ""),  numbers[i])
    return(numbers)
}

#lake is the .RDS SCEset downloaded from: https://hemberg-lab.github.io/scRNA.seq.datasets/human/brain/
lake <- readRDS("../../publicdata/Hemberg_scRNA_datasets/human_brain/lake.Rds")

lake.seurat <- as.Seurat(lake, data = "logcounts", counts = "normcounts")
lake.seurat <- SCTransform(lake.seurat)
lake.seurat <- RunPCA(lake.seurat)
lake.seurat <- RunUMAP(lake.seurat, dims = 1:20)
Idents(lake.seurat) <- colData(datasets$Lake)$cell_type1

lake.pca <- Embeddings(lake.seurat, "pca")

sketch.size <- rev(seq(3000,300,by=-100))

filenames <- zeropad(3001 - sketch.size)

#UMAP plots

for(i in 1:length(sketch.size)){
sketch.indices <- geosketch$gs(lake.pca[,1:20], as.integer(sketch.size[i]))
lake.red <- lake.seurat[,as.numeric(sketch.indices)]
p <- DimPlot(lake.red) + labs(title = paste0(sketch.size[i], " cells sampled"))
ggsave(filename =  paste0("./lake/umap/", filenames[i], ".png"), plot = p, device = "png")
}

#cell type barplots

for(i in 1:length(sketch.size)){
sketch.indices <- geosketch$gs(lake.umap, as.integer(sketch.size[i]))
lake.red <- lake.seurat[,as.numeric(sketch.indices)]
png(file = paste0("./lake/barplot/", filenames[i], ".png"), width = 800, height = 300)
par(mar = c(6,4,4,2))
barplot(table(Idents(lake.red)), las = 2, col = colorspace::qualitative_hcl(n = length(table(Idents(lake.red)))), border = NA, ylab = "# cells", main = paste0(sketch.size[i], " cells sampled"), ylim = c(0, max(table(Idents(lake.seurat)))))
dev.off()
}
# this is done in the shell using ImageMagick

convert -delay 20 ./lake/umap/*.png -loop 0 ./lake/movie_umap.gif
convert -delay 20 ./lake/barplot/*.png -loop 0 ./lake/movie_barplot.gif

As per the rest of your answers, I appreciate them a lot and will definitely investigate more into how different sketch sizes translate to efficient/good integrations. I have access to a pretty large HPC cluster in Singapore, so I have no unreasonable cap on resources; I just want to have timely results so as to know where/when to adjust my parameters.

Thanks a lot again!