gagolews / genieclust

Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection - in Python and R
https://genieclust.gagolewski.com
Other
58 stars 10 forks source link

providing initial cluster counts #78

Open tritolol opened 1 year ago

tritolol commented 1 year ago

Hi, thanks for this awesome package.

Is there a way to provide initial counts values before starting to fit? I know that normally in agglomerative clustering, each observation is considered as a cluster and therefore each count value starts at 1. However, I would like to experiment with different initial distributions. Is this possible with your current implementation?

gagolews commented 1 year ago

Yeah-nah, I have kind-of done it in my recent paper Clustering with minimum spanning trees: How good can it be? https://arxiv.org/abs/2303.05679, using some code from genieclust, but with custom modifications.

There is also now the GIc algorithm implementation available (https://genieclust.gagolewski.com/genieclust_gic.html) – it starts from the intersection of the partitions returned by Genie with different thresholds. Similar idea, but does not allow for an arbitrary starting partition still.

Anyway, there is no public API in the package that would allow that, only modifying src/c_genie.h, but good luck with that.

For own experiments, I think it would be easier to implement Genie from scratch (the algorithm is quite simple if you do not care about its time complexity) in Python and then modify it accordingly. All the necessary bits are available: genieclust.inequity.gini_index, genieclust.internal.DisjointSets or genieclust.internal.GiniDisjointSets, genieclust.internal.mst_from_distance.

Let me know how motivated/confident/adventurous do you feel about the above, maybe I can help with something.

tritolol commented 1 year ago

Thanks for this helpful response.

I think I have what I need now by using this (dumb) workaround: I simply repeat the rows of my feature matrix according to the distribution in a vector. Then I use Genie on this matrix instead and ignore the first merging level in the linkage tree.

Here is my code using pytorch

embeddings = ... # feature matrix
distribution = ... # vector of positive ints and shape = embeddings.shape[0]
emb_repeated = torch.repeat_interleave(embeddings, distribution, dim=0)
g = genieclust.Genie(compute_full_tree=True).fit(emb_repeated)

Let me know if this makes sense.