gagolews / genieclust

Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection - in Python and R
https://genieclust.gagolewski.com
Other
58 stars 10 forks source link

genieclust is very sensitive to dimensionality of the input matrix #71

Open gokceneraslan opened 3 years ago

gokceneraslan commented 3 years ago

Hi,

I have learned about this nice method recently, and have been exploring it since then. I have a question about some weird results I am getting with some high dimensional gene expression datasets.

image

Here I am clustering a matrix of size 2700 × 1000 using default parameters (e.g. Genie(n_clusters=3)) and visualizing the results using 2D UMAP view. I think it's very clear that there are 3 big clusters in this dataset (which is also in concordance with known biology) however genie fails to properly cluster the dataset and assign most points to a single cluster.

When I reduce the dimensionality to 50 using PCA, results make more sense:

image

So I was wondering if there is a better way to set the parameters of the method so that clustering results make sense regardless of the dimensionality of the input matrix. What do you think? Interestingly, the original dimensionality in gene expression datasets is around 20,000 which is roughly the number of protein coding genes. So the initial matrix of size 2700 × 1000 is actually already subsetted to 1k genes.

In addition, here I am plotting the graph-based clustering result using the Louvain method (which is the default clustering method in our field):

image

This result looks very clear and also consistent with biology, I think it'd be cool to use this as some sort of ground truth.

Code for reproducing the plots are given below. It requires scanpy and louvain python packages.

```python #### Generate the input matrix #### # pip install scanpy import scanpy as sc sc.set_figure_params(dpi=100) ad = sc.datasets.pbmc3k() sc.pp.filter_genes(ad, min_cells=10) ad.layers['counts'] = ad.X.copy() sc.pp.normalize_total(ad, target_sum=10000) sc.pp.log1p(ad) sc.pp.highly_variable_genes(ad, n_top_genes=1000, flavor='seurat_v3', subset=True, layer='counts') sc.pp.scale(ad, max_value=8) sc.pp.pca(ad) sc.pp.neighbors(ad) sc.tl.umap(ad) # pip install louvain sc.tl.louvain(ad, resolution=0.2) X_hidim = ad.X X_lodim = ad.obsm['X_pca'] #### Clustering #### import genieclust g = genieclust.Genie(n_clusters=3) labels = g.fit_predict(X_hidim) ad.obs['genie_labels'] = labels.astype(str) sc.pl.umap(ad, color='genie_labels') g = genieclust.Genie(n_clusters=3) labels = g.fit_predict(X_lodim) ad.obs['genie_labels'] = labels.astype(str) sc.pl.umap(ad, color='genie_labels') sc.pl.umap(ad, color='louvain') ```
gagolews commented 3 years ago

Thanks! You've raised some interesting points!

I would say that Genie does a pretty good job in identifying these 3 blobs that can be seen on that 2-dimensional projection of the input data set, when it works on another 50-dimensional projection generated by the PCA. If these 3 blobs are what you are after, prior to using Genie, I'd recommend first applying some dimensionality reduction technique (such as PCA, t-SNE, or whatever), and then performing the clustering in the projected space.

However, that 2-dimensional projection is just one of many (like, uncountably many) possible ones. Oftentimes, there is rarely "the one and only / true clustering", especially when we deal with high-dimensional data. Of course, sometimes a "weird" result like the one above will turn out pretty much useless. On the other hand, it may also happen that what you've generated is a new, ground-breaking discovery. In other words, in terms of the original sample, are you sure that the clustering that Genie has output doesn't reveal anything interesting about the "hidden" structure of this dataset?

Keep in mind that Genie constructs a minimum spanning tree based on (by default) the Euclidean distances between each pair of points. So some data preprocessing (e.g., with the PCA) quite frequently is indeed necessary. In particular, here is how the MST (computed in the original space w.r.t. the Euclidean distance) looks like (when drawn on that particular 2D projection):

mst = genieclust.internal.mst_from_distance(X_hidim)
genieclust.plots.plot_segments(mst[1], ad.obsm["X_umap"])

image

It may be the case that these points are not that well separable in the 1000-dimensional space after all. Hence, Genie will fail to separate these 3 blobs from each other. It just works on what it is fed with.

However, we can choose a different metric - cosine similarity:

g = genieclust.Genie(n_clusters=3, affinity='cosine')
labels = g.fit_predict(X_hidim)
ad.obs['genie_labels'] = labels.astype(str)
sc.pl.umap(ad, color='genie_labels')

which yields

image

Also - clustering w.r.t. the cosine similarity (with noise-point detection turned on) - 4 clusters - genieclust.Genie(n_clusters=4, affinity='cosine', M=25, postprocess="all") gives:

image

To sum-up, Genie is a stand-alone method that assumes it's being run on a somehow pre-"carved" dataset. If it's applied on raw data, well, the results can be "unrefined" as well (or "novel" if by any chance we've just made a new discovery). This can be something as easy as the code block below: [The Louvain algorithm does something similar under the hood - it is applied on a transformed distance matrix as well]

from sklearn.decomposition import PCA
g = genieclust.Genie(n_clusters=3)
labels = g.fit_predict(PCA(20).fit_transform(X_hidim))  # <----
ad.obs['genie_labels'] = labels.astype(str)
sc.pl.umap(ad, color='genie_labels')

image