Closed leonardlin closed 1 year ago
Yes, you can achieve that by passing compute_full_tree=True
to the constructor.
This way, the whole "linkage matrix" can be generated, together with the corresponding distances.
See Dendrograms in https://genieclust.gagolewski.com/weave/basics.html and the description of children_
and distances_
in https://genieclust.gagolewski.com/genieclust_genie.html#genieclust.Genie
Some code as a starting point for your experiments:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import clustbench
# Load example data
data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
benchmark = clustbench.load_dataset("wut", "x2", url=data_url)
X = benchmark.data
# Fit with compute_full_tree=True
import genieclust
g = genieclust.Genie(compute_full_tree=True)
g.fit(X)
# Plot the dendrogram
import scipy.cluster
linkage_matrix = np.column_stack([g.children_, g.distances_, g.counts_])
scipy.cluster.hierarchy.dendrogram(linkage_matrix,
show_leaf_counts=False, no_labels=True)
plt.show()
# Iteration (merge step) with a prescribed distance:
w = np.argmax(g.distances_ > 0.3)
w
# How many clusters?
k = X.shape[0]-w
k
# Determine the k-partition:
g.set_params(n_clusters=k)
c = g.fit_predict(X)
# Plot it:
genieclust.plots.plot_scatter(X, labels=c)
plt.show()
Let me know if I can give you any more details/hints about the above.
@gagolews thanks for your feedback. I tried your approach but the resulting clusters didn't look sensible.
the resulting g.distances_ are all relatively low. I'm using affinity cosinesimil
I assume that gdistances refers to the distances between clusters on a scale between 0...1 It's not exactly the same as threshold in AgglomerativeClustering but I guess similiar.
I post here if I find something that works.
quick update from my side for documentation. the ticket can be closed afterwards.
I'm using this function: https://www.sbert.net/docs/package_reference/util.html#sentence_transformers.util.community_detection to estimate the number of clusters first.
than I use the cluster count to run Genie.
# determine cluster count
clusters = sentence_transformers.util.community_detection(corpus_embeddings, min_community_size=1, threshold=threshold)
cluster_count = len(clusters)
# clustering
clustering_model = genieclust.Genie(n_clusters = cluster_count, affinity='cosinesimil', exact=False)
cluster_assignment = clustering_model.fit_predict(corpus_embeddings)
this still scales better than AgglomerativeClustering (exact=False) and provides better clusters than community_detection itself.
community_detection has very good performance/quality ratio. Very useful to get a preview or estimate of clusters
Hi, I've been testing genieclust as a replacement for sklearn.cluster.AgglomerativeCluster. I love that the API is so similar and almost a drop-in-replacement to sklearn.cluster.*
I currently use AgglomerativeCluster because
Is there a way to achieve clustering by distance_threshold with genieclust? Ideally I don't know the number of clusters ahead of time but I know that each cluster should similar to a certain degree.
Thanks in advance for any help.