Reclustering of summarized data

Definitions:

Observations: typically a cell or group of cells
Features: typically a gene, protein, or group of genes or proteins.

Assumptions

Assume that cellxgene has implemented some form of data summary tool that maps observations and features into distance space. See #632 as an example.
Visualization can be any of: (observations x observations), (features x features), or (observations x features)
Assume users are able to sub-select a set of observations or features and wish to visualize the sub-selection.

Problem statement

Visualizing a subset of summarized data may result in them appearing disordered, because they have not been grouped according to their distance to nearest cells. Users wish to maintain a visualization where the most similar genes and most similar observations are close together. This is similar in theory to the needs expressed in #280

Potential Solutions:

This problem is typically solved with hierarchical clustering. Hierarchical clustering can be decomposed into two steps:

distance calculation
agglommeration

Distance calculation is very slow, while agglommeration is very fast. If users precompute distances between observations and between features (one matrix for each axis) and enclose that data in the h5ad file, this step can be skipped.

If users want to make new, arbitrary observations groupings not present in the metadata, the distances can be averaged across observations in the groups at lower computational cost than recomputing the distance matrix. See meandist.R for an example. A brief search suggests that most approaches scale linearly with observations, so may reach a number of observations where they can't be done interactively.

Out of scope:

biclustering, which orders data in the view based on both column distances and row distances. Most formulations of this are NP-complete. This issue considers sequential ordering: row distances and then column distances or vice versa.

cc @colinmegill

Hello, a contributing thought: One such distance space is the latent space of scVI (https://scvi-tools.org), which is an autoencoder model that maps each cell in the gene count matrix to a (typically) 10-dimensional real valued vector. Thus the latent space is readily amenable to classical clustering techniques like k-means or DBSCAN, which work well with it, and can be calculated very quick even for a 1M cell dataset. Personally I think clustering would be cool, but that it is not an urgent feature. I think everyone would prefer to be able to have more DE methods to choose from before having clustering available. This is because clustering is typically only done once/a few times in the initial exploration, usually by someone comfortable with programming and statistical methods, but DE is done again and again and is of great interest to biologists.

chanzuckerberg / cellxgene