[WIP, Exploration] Determine whether interactive-speed subset/embed workflows in cellxgene might be feasible by clarifying if components of the UMAP algorithm can be precomputed

ambrosejcarr commented 2 years ago

Background

Most single cell datasets have high dimensionality. For scRNA-seq, each cell is an observation, and the measurements, or dimensions that characterize each cell are all the genes. To visualize the similarity of observations on a 2-dimensional screen, the data are projected using dimensionality reduction algorithms. The most commonly used algorithms for dimensionality reduction in single-cell analysis are tSNE and UMAP. Both algorithms project high dimensional data into 2 dimensions by favoring faithfully representing "close" distances but only roughly approximating "distant" cells.

Intuitively, this means that cells placed near one another can be assumed to be similar, but that assumptions cannot be made about the distances of cells placed far away. These characteristics work well for single cell workflows that attempt to find either discrete cell types or continuous developmental changes in expression, since both can be visually detected in a projection that maintains local distances.

However, another characteristic is that the projection is influenced, to an extent determined by selected hyper-parameters, by the following data characteristics:

The number of cells in each "type". large populations may dominate projected space under some parameter selections.
The global distance matrix. If there are a number of very distinct populations, sub-structure within populations may be lost because the algorithm balances displaying sub-structure with representing global distances.

These characteristics of low dimensional projection can obscure local structure in cases where cell types are unevenly distributed or are globally highly distinct.

Story

When a scientist is exploring a heterogenous dataset composed of several distinct types of cells, they want to isolate and re-project a subset of cells which appear similar, to determine whether there is interesting sub-structure (e.g. cell sub-types or continuous expression differences) that was being obscured by the global dataset.

Workflow

A scientist selects a subset of observations from the full dataset
A scientist clicks "re-project" which generates a new embedding of the subset

Challenge

Computing UMAP is a compute-intensive task where only some aspects can be easily parallelized. Can we make this fast enough to execute at interactive speed?

Exploration (WIP)

ambrosejcarr commented 2 years ago

@bkmartinjr notes that small reembed queries of ~ 100k cells may already work within the interactive time barrier on the desktop version - determining whether we can limit to small queries may eliminate the need for precomputation.

I will pivot to a rough assessment of projected query sizes by exploring the current abundances of each node in the ontology tree (both types and "accumulator nodes") and extrapolating to their abundances given a larger data corpus.

I will examine our data and evaluate what nodes are logical targets for embedding or reembedding. If we would need to subsample, I will assume for now that the user goal is to learn the underlying manifold, and so subsampling that does not disrupt the manifold would be accepted by users.

metakuni commented 8 months ago

@ambrosejcarr : Is this still relevant or can we close this?

brianraymor commented 2 months ago

@sidneymbell @niknak33 - per Kuni's question to @ambrosejcarr - is this still relevant or can it be closed?

atarashansky commented 2 months ago

@ambrosejcarr Idea:

build a precomputed knn search index from the full data and when subsetting, query the subset against that index
then use the coordinates from the subset of the full UMAP embedding as initial coordinates for stochastic gradient descent, which will decrease the time until convergence
for more speedups, consider using gpu-accelerated umap

chanzuckerberg / single-cell-explorer