chanzuckerberg / cellxgene

An interactive explorer for single-cell transcriptomics data
https://chanzuckerberg.github.io/cellxgene/
MIT License
596 stars 112 forks source link

Reclustering of summarized data #1207

Open ambrosejcarr opened 4 years ago

ambrosejcarr commented 4 years ago

Reclustering of summarized data

Definitions:

Assumptions

Problem statement

Visualizing a subset of summarized data may result in them appearing disordered, because they have not been grouped according to their distance to nearest cells. Users wish to maintain a visualization where the most similar genes and most similar observations are close together. This is similar in theory to the needs expressed in #280

Potential Solutions:

This problem is typically solved with hierarchical clustering. Hierarchical clustering can be decomposed into two steps:

Distance calculation is very slow, while agglommeration is very fast. If users precompute distances between observations and between features (one matrix for each axis) and enclose that data in the h5ad file, this step can be skipped.

If users want to make new, arbitrary observations groupings not present in the metadata, the distances can be averaged across observations in the groups at lower computational cost than recomputing the distance matrix. See meandist.R for an example. A brief search suggests that most approaches scale linearly with observations, so may reach a number of observations where they can't be done interactively.

Out of scope:

cc @colinmegill

Munfred commented 3 years ago

Hello, a contributing thought: One such distance space is the latent space of scVI (https://scvi-tools.org), which is an autoencoder model that maps each cell in the gene count matrix to a (typically) 10-dimensional real valued vector. Thus the latent space is readily amenable to classical clustering techniques like k-means or DBSCAN, which work well with it, and can be calculated very quick even for a 1M cell dataset. Personally I think clustering would be cool, but that it is not an urgent feature. I think everyone would prefer to be able to have more DE methods to choose from before having clustering available. This is because clustering is typically only done once/a few times in the initial exploration, usually by someone comfortable with programming and statistical methods, but DE is done again and again and is of great interest to biologists.