Open ambrosejcarr opened 2 years ago
@bkmartinjr notes that small reembed queries of ~ 100k cells may already work within the interactive time barrier on the desktop version - determining whether we can limit to small queries may eliminate the need for precomputation.
I will pivot to a rough assessment of projected query sizes by exploring the current abundances of each node in the ontology tree (both types and "accumulator nodes") and extrapolating to their abundances given a larger data corpus.
I will examine our data and evaluate what nodes are logical targets for embedding or reembedding. If we would need to subsample, I will assume for now that the user goal is to learn the underlying manifold, and so subsampling that does not disrupt the manifold would be accepted by users.
@ambrosejcarr : Is this still relevant or can we close this?
@sidneymbell @niknak33 - per Kuni's question to @ambrosejcarr - is this still relevant or can it be closed?
@ambrosejcarr Idea:
Background
Most single cell datasets have high dimensionality. For scRNA-seq, each cell is an observation, and the measurements, or dimensions that characterize each cell are all the genes. To visualize the similarity of observations on a 2-dimensional screen, the data are projected using dimensionality reduction algorithms. The most commonly used algorithms for dimensionality reduction in single-cell analysis are tSNE and UMAP. Both algorithms project high dimensional data into 2 dimensions by favoring faithfully representing "close" distances but only roughly approximating "distant" cells.
Intuitively, this means that cells placed near one another can be assumed to be similar, but that assumptions cannot be made about the distances of cells placed far away. These characteristics work well for single cell workflows that attempt to find either discrete cell types or continuous developmental changes in expression, since both can be visually detected in a projection that maintains local distances.
However, another characteristic is that the projection is influenced, to an extent determined by selected hyper-parameters, by the following data characteristics:
These characteristics of low dimensional projection can obscure local structure in cases where cell types are unevenly distributed or are globally highly distinct.
Story
When a scientist is exploring a heterogenous dataset composed of several distinct types of cells, they want to isolate and re-project a subset of cells which appear similar, to determine whether there is interesting sub-structure (e.g. cell sub-types or continuous expression differences) that was being obscured by the global dataset.
Workflow
Challenge
Computing UMAP is a compute-intensive task where only some aspects can be easily parallelized. Can we make this fast enough to execute at interactive speed?
Exploration (WIP)