Closed syouligan closed 4 years ago
Hi @syouligan , no problem! More than happy to help.
You're correct that in the paper, we clustered on (10-dimensional) PHATE. However, in practice, the distances in PHATE approach the distances on the PHATE potential as the number of dimensions goes to infinity, and since we're no longer visualizing (but instead clustering), we can just use the potential itself rather than the low-dimensional approximation.
This behaviour is undocumented in R, but you can access the potential by running
phate.out <- phateR::phate(data)
phate.potential <- phate.out$operator$diff_potential
and then in the paper, we do k-means on this object, which is mathematically similar to spectral clustering but with the additional benefits from PHATE over Laplacian Eigenmaps.
I'm not sure what will happen if you use more sophisticated clustering algorithms on it such as the community detection you've shown above, but it's likely to give similar results.
Hi @scottgigante
Great, thanks for this. One of the challenges with using k-means I find is pre-defining the k-number of clusters. Do you use/recommend a Bayesian Information Criterion (BIC) heuristic or something like that to select K?
Thanks
Hi @syouligan , you're right that this is a tricky problem. We often use the silhouette score, but this is a problem that is really almost impossible to solve and depends somewhat on your application.
Awesome, thanks mate. Are there any packages you can recommend that can calculate silhouette score on a sparse matrix?
To be honest I typically do this kind of thing in Python. However if memory is an issue you might want to avoid silhouette as it requires computing pairwise distances. Not sure if AIC/BIC would avoid this.
Ok cool, thanks for your help.
Hi there
Sorry for spamming you with questions, just figuring out the best way to integrate this epic tool into our pipelines.
In the PHATE paper (Fig.6) it appears that clustering was performed using the PHATE embeddings as input. Is this correct? In which case I was wondering how the number of embeddings to use was chosen. The analysis I am considering is: