LTLA / bluster

Clone of the Bioconductor repository for the bluster package.
https://bioconductor.org/packages/devel/bioc/html/bluster.html
2 stars 3 forks source link

Distance/dissimilarity measures: extension #13

Closed antagomir closed 1 year ago

antagomir commented 1 year ago

The bluster package is currently relying on stats::dist for distance calculations in the clustering process.

Limitation in this is that the stats::dist function covers only a relatively small set of dissimilarity indices. For instance, it is missing many dissimilarity indices that are commonly used in ecological analyses and available for instance through vegan::vegdist. Extending the availability of dissimilarity indices would be beneficial for making the bluster package support other applications of SummarizedExperiment family, for instance in microbiome research that we are working on. Providing access to readily available dissimilarity indices would support users.

Suggested solution:

This would concern multiple functions.

The process would then look, for instance in the context of clusterRows and hierarchical clustering, something like:

clusterRows(sce, distfun=stats::dist, HclustParam(metric="euclidean"))

clusterRows(sce, distfun=vegan::vegdist, HclustParam(metric="bray"))

etc.

LTLA commented 1 year ago

Seems reasonable, though the distance function would be a parameter of HclustParam, not clusterRows, given that not all clustering methods would easily support custom distance calculations (e.g., k-means wouldn't care).

Happy to take a PR, if you can demonstrate a MVP with HclustParam.

antagomir commented 1 year ago

Great. We will have a look and see how it goes.

antagomir commented 1 year ago

Done. By @BananaCancer