gagolews / genieclust

Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection - in Python and R
https://genieclust.gagolewski.com
Other
58 stars 10 forks source link

distance_threshold #77

Closed leonardlin closed 1 year ago

leonardlin commented 1 year ago

Hi, I've been testing genieclust as a replacement for sklearn.cluster.AgglomerativeCluster. I love that the API is so similar and almost a drop-in-replacement to sklearn.cluster.*

I currently use AgglomerativeCluster because

Is there a way to achieve clustering by distance_threshold with genieclust? Ideally I don't know the number of clusters ahead of time but I know that each cluster should similar to a certain degree.

Thanks in advance for any help.

gagolews commented 1 year ago

Yes, you can achieve that by passing compute_full_tree=True to the constructor.

This way, the whole "linkage matrix" can be generated, together with the corresponding distances.

See Dendrograms in https://genieclust.gagolewski.com/weave/basics.html and the description of children_ and distances_ in https://genieclust.gagolewski.com/genieclust_genie.html#genieclust.Genie

Some code as a starting point for your experiments:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import clustbench

# Load example data 
data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
benchmark = clustbench.load_dataset("wut", "x2", url=data_url)
X = benchmark.data

# Fit with compute_full_tree=True
import genieclust
g = genieclust.Genie(compute_full_tree=True)
g.fit(X)

# Plot the dendrogram
import scipy.cluster
linkage_matrix = np.column_stack([g.children_, g.distances_, g.counts_])
scipy.cluster.hierarchy.dendrogram(linkage_matrix,
    show_leaf_counts=False, no_labels=True)
plt.show()

# Iteration (merge step) with a prescribed distance:
w = np.argmax(g.distances_ > 0.3)
w

# How many clusters?
k = X.shape[0]-w
k

# Determine the k-partition:
g.set_params(n_clusters=k)
c = g.fit_predict(X)

# Plot it:
genieclust.plots.plot_scatter(X, labels=c)
plt.show()
gagolews commented 1 year ago

Let me know if I can give you any more details/hints about the above.

leonardlin commented 1 year ago

@gagolews thanks for your feedback. I tried your approach but the resulting clusters didn't look sensible.

the resulting g.distances_ are all relatively low. I'm using affinity cosinesimil

I assume that gdistances refers to the distances between clusters on a scale between 0...1 It's not exactly the same as threshold in AgglomerativeClustering but I guess similiar.

I post here if I find something that works.

leonardlin commented 1 year ago

quick update from my side for documentation. the ticket can be closed afterwards.

I'm using this function: https://www.sbert.net/docs/package_reference/util.html#sentence_transformers.util.community_detection to estimate the number of clusters first.

than I use the cluster count to run Genie.

# determine cluster count
clusters = sentence_transformers.util.community_detection(corpus_embeddings, min_community_size=1, threshold=threshold)
cluster_count = len(clusters)

# clustering
clustering_model = genieclust.Genie(n_clusters = cluster_count, affinity='cosinesimil', exact=False)
cluster_assignment = clustering_model.fit_predict(corpus_embeddings)

this still scales better than AgglomerativeClustering (exact=False) and provides better clusters than community_detection itself.

community_detection has very good performance/quality ratio. Very useful to get a preview or estimate of clusters