amnh / PCG

𝙋𝙝𝙮𝙡𝙤𝙜𝙚𝙣𝙚𝙩𝙞𝙘 𝘾𝙤𝙢𝙥𝙤𝙣𝙚𝙣𝙩 𝙂𝙧𝙖𝙥𝙝 ⸺ Haskell program and libraries for general phylogenetic graph search
28 stars 1 forks source link

Write a k-medians clustering algorithm #142

Open Boarders opened 5 years ago

Boarders commented 5 years ago

Currently our clustering pass uses the algorithms in AI.Clustering.Hierarchical. As it stands there is no easy way to make use of the k-means algorithm within this library.

We should write a k-medians algorithm which is linear time and has good convergence properties (it should also expose a good amount of parallelism). This will entail writing a k-medians function for each of our different character types i.e. functions that looks something like:

kMedian :: (Foldable t, Traversable t) => t CharacterType -> CharacterType

In the case of a discrete character this will function will look like an inclusion-exclusion formula and we should be able to write similar things for other character types.

This would then be added as a Clustering option (currently asking for median clustering gives an error explaining that this is not yet implemented).

Boarders commented 5 years ago

This commit gives a version of K-medians using essentially Lloyds's algorithm for refinement and RGC (centroids of random subsamples) for initialisation. It is abstract enough that it would work as both an implementation of k-medians or k-means. This also exposes some parallelism in how assignment is done for each cluster, but finding the correct way to chunk the input will require experimentation.

Still left on this issue is to define a k-medians function on CharacterSequences.