brianhie / geosketch

Geometry-preserving random sampling
http://geosketch.csail.mit.edu
MIT License
75 stars 11 forks source link

Bug, Non-random output - same observation sampled repeatedly in subsequent calls to gs() with same sketch size #12

Closed dburkhardt closed 3 years ago

dburkhardt commented 3 years ago

Expected behavior During subsequent runs, I would expect to find different random seeds being used making it rare to see the same cell in multiple samples from adata.

Observed behavior For several sketch sizes, there is always one index of adata returned in subsequent calls to gs(). The index changes depending on the sketch size, but is always the same when calling gs() with a given sketch size. In this dataset, every sketch of size N=4 includes observation 25419. Similarly, every sketch of size N=10 includes observation 17479.

Steps to reproduce:

Load data

from geosketch import gs
import numpy as np
import scprep
import scanpy as sc
import os
download_path = os.path.expanduser("~/scratch")
download_path = os.path.join(download_path,'Wagner2018.processed.h5ad')
URL = 'https://ndownloader.figshare.com/files/25687247?private_link=f194ae7d6bcec9bd11a3'
scprep.io.download.download_url(URL, download_path)
adata = sc.read_h5ad(download_path)
sc.tl.pca(adata)

Draw samples of size four. Count number of times each index is observed

n_obs_subsample = 4

indices = []
for i in range(10):
    indices.append(gs(adata.obsm['X_pca'], n_obs_subsample, replace=False))

indices = np.hstack(indices)
indices, counts = np.unique(indices, return_counts=True)

returns

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  3,  1,  1,  1,  1,  2,  1,  1, 10])

This is true even when explicitly setting different seeds:

image

or

image
brianhie commented 3 years ago

This may be expected behavior, but I can take a closer look soon.

Geometric sketching will preferentially sample a cell if it is not “close” to any other cells, or in a less dense region. If there is an outlier cell very far from any other cells, it can be sampled rather robustly across multiple random seeds.

In general, it’s this deviation from perfect uniform sampling that we were trying to achieve with this algorithm.

dburkhardt commented 3 years ago

Thanks for the quick follow up! I can understand that it's not going to be the same probability as a uniform random, but it seems strange that there would always be a single index that appears once for a given sketch size. If it were expected, wouldn't the index that's repeated for N=4 also be present in N=10?

I would believe it's a weird pattern present in this dataset. This is the Cas9 knockout data from Wagner et al. (2018).

For example, here is the output of results where the sketch size is 20. I'm getting a lot of observations from that disconnected cluster in the upper right of the plot that I would like. I do acknowledge that this might be due to a higher than expect variance in that population, but I'd like to see more representation from the other 25 clusters in the dataset.

Clusters (from the publication) geosketch clusters

Sketches N=20 geosketch n20

To quantify it, over 10 rounds, on average 45% of the observations in the sketch come from the "hatching gland" sample.

Cluster name and the average number of times a cell from that cluster are seen in a sketch of size N=20

differentiating neurons :  0
differentiating neurons - rohon beard :  1.0
endoderm :  0
epidermal - foxi3a :  1.0
epidermal - otic placode :  1.0
epidermal - pfn1 :  1.25
epidermal anterior :  1.4444444444444444
germline :  0
hatching gland :  9.0 <---------- This is way more than I would expect
lateral line primordium :  0
mesoderm - blood island :  1.0
mesoderm - endothelial :  0
mesoderm - heart field :  0
mesoderm - lateral plate - cxcl12a :  0
mesoderm - pharyngeal arch :  0
mesoderm - pronephric duct :  0
neural -  floorplate :  0
neural - diencephalon  :  0
neural - hindbrain :  0
neural - midbrain :  1.0
neural - telencephalon :  0
neural crest :  0
neural crest - crestin :  0
notocord :  1.0
optic primordium :  0
periderm :  3.0
tailbud - PSM :  3.4
tailbud - spinal cord :  1.0
brianhie commented 3 years ago

With different sketch sizes, the "plaid covering" boundaries could be different leading to different cells being more upweighted. I would be interested in two things: trying a lot more samples (~1 million) at more sketch sizes, aside from 4 and 10.

FWIW, I can't reproduce this issue when subsampling 4 samples from a larger set of 1000 samples from a 10-dimensional, isotropic Gaussian. So the issue could be due to a pathology with this particular dataset.

Regarding the evenness of samples in that dataset, the visualizations actually make sense to me. I'm assuming that the 2-D coordinates are from a nonlinear embedding approach like PHATE, so there is probably some density distortion especially near the fringes of the plot. If you want a more "even" volume sample that comes across in a PHATE visualization, you could try geometric sketching while sampling from PHATE-space (say, the top 100 dimensions) rather than PCA-space. Also, a sketch size of 20 is a pretty small sample (smaller than anything we considered in the paper), and I would be interested to see how the cell type compositions change with sketch size.

dburkhardt commented 3 years ago

Interesting, I don't have this same bug of repeated indices for this data when I try to run geosketch on a 2 dimensional PHATE embedding.

For the application I have, I need to pick a number of reference or landmark points where the number of references is much smaller than the number of points in the data, so I need a method to select these that works well when the sample being drawn is small.

I tried binning on PHATE dimensions (you're right, that's the embedding here). I've tried PHATE run to a few different number of dimensions (PHATE uses MDS in a final step so it needs to be regenerated each time for each embedding).

I guess it should be unsurprising that the sketches look the most uniformly sampled on the PHATE plot when we use 2 components. It's encouraging though that this also produces more uniform sampling of the clusters, which were labelled using a kNN label projection scheme independently of PHATE.

2 PHATE components geosketch_on_phate-2 n20

10 PHATE components geosketch_on_phate-10 n20

100 PHATE components geosketch_on_phate-100 n20

brianhie commented 3 years ago

Very cool!! Uniformity area-wise at 2 components is a good sign, and it still looks good to me at 100 components. There should be some sketching paper that looks at nonlinear dimensionality reduction + geometric/covering approaches.