Using PHATE embeddings to detect clusters

syouligan commented 4 years ago

Hi there

Sorry for spamming you with questions, just figuring out the best way to integrate this epic tool into our pipelines.

In the PHATE paper (Fig.6) it appears that clustering was performed using the PHATE embeddings as input. Is this correct? In which case I was wondering how the number of embeddings to use was chosen. The analysis I am considering is:

filtered_exp #SCE object, filtered, normalised
fastMNN.sce <- batchelor::fastMNN(filtered_exp,
                       subset.row=HVG,
                       cos.norm = FALSE,
                       correct.all = TRUE,
                       batch = filtered_exp$Sample,
                       merge.order = merge_order,
                       weights = 1/met_weights) # batch correction

fastMNN.sce #SCE object with batch corrected ("reconstructed")
phate.out <- phateR::phate(Matrix::t(assay(fastMNN.sce, "reconstructed")), embeddings = n?) # Run PHATE on batch corrected values number of embedding to be determined
reducedDim(fastMNN.sce, "PHATE_fastMNN") <- phate$embedding # Add PHATE embedding to sce object
snn.gr <- scran::buildSNNGraph(fastMNN.sce, use.dimred="PHATE_fastMNN") #build SNN-graph based on batch corrected PHATE embeddings
clusters <- igraph::cluster_walktrap(snn.gr)$membership #detect clusters

scottgigante commented 4 years ago

Hi @syouligan , no problem! More than happy to help.

You're correct that in the paper, we clustered on (10-dimensional) PHATE. However, in practice, the distances in PHATE approach the distances on the PHATE potential as the number of dimensions goes to infinity, and since we're no longer visualizing (but instead clustering), we can just use the potential itself rather than the low-dimensional approximation.

This behaviour is undocumented in R, but you can access the potential by running

phate.out <- phateR::phate(data)
phate.potential <- phate.out$operator$diff_potential

and then in the paper, we do k-means on this object, which is mathematically similar to spectral clustering but with the additional benefits from PHATE over Laplacian Eigenmaps.

I'm not sure what will happen if you use more sophisticated clustering algorithms on it such as the community detection you've shown above, but it's likely to give similar results.

syouligan commented 4 years ago

Hi @scottgigante

Great, thanks for this. One of the challenges with using k-means I find is pre-defining the k-number of clusters. Do you use/recommend a Bayesian Information Criterion (BIC) heuristic or something like that to select K?

Thanks

scottgigante commented 4 years ago

Hi @syouligan , you're right that this is a tricky problem. We often use the silhouette score, but this is a problem that is really almost impossible to solve and depends somewhat on your application.

syouligan commented 4 years ago

Awesome, thanks mate. Are there any packages you can recommend that can calculate silhouette score on a sparse matrix?

scottgigante commented 4 years ago

To be honest I typically do this kind of thing in Python. However if memory is an issue you might want to avoid silhouette as it requires computing pairwise distances. Not sure if AIC/BIC would avoid this.

syouligan commented 4 years ago

Ok cool, thanks for your help.

KrishnaswamyLab / phateR

Using PHATE embeddings to detect clusters #41