Distance/dissimilarity measures: extension

antagomir commented 1 year ago

The bluster package is currently relying on stats::dist for distance calculations in the clustering process.

Limitation in this is that the stats::dist function covers only a relatively small set of dissimilarity indices. For instance, it is missing many dissimilarity indices that are commonly used in ecological analyses and available for instance through vegan::vegdist. Extending the availability of dissimilarity indices would be beneficial for making the bluster package support other applications of SummarizedExperiment family, for instance in microbiome research that we are working on. Providing access to readily available dissimilarity indices would support users.

Suggested solution:

Add support for vegan::vegdist in the bluster package
Implement this so that the user could define the distance function as a function argument. This way one could avoid adding new dependencies in the bluster package.

This would concern multiple functions.

The process would then look, for instance in the context of clusterRows and hierarchical clustering, something like:

clusterRows(sce, distfun=stats::dist, HclustParam(metric="euclidean"))

clusterRows(sce, distfun=vegan::vegdist, HclustParam(metric="bray"))

etc.

LTLA commented 1 year ago

Seems reasonable, though the distance function would be a parameter of HclustParam, not clusterRows, given that not all clustering methods would easily support custom distance calculations (e.g., k-means wouldn't care).

Happy to take a PR, if you can demonstrate a MVP with HclustParam.

antagomir commented 1 year ago

Great. We will have a look and see how it goes.

antagomir commented 1 year ago

Done. By @BananaCancer

LTLA / bluster

Distance/dissimilarity measures: extension #13