gagolews / genieclust

Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection - in Python and R
https://genieclust.gagolewski.com
Other
58 stars 11 forks source link

`ValueError: k >= n` When there k = 2 and n = 3 (exact=False) #80

Closed sergeyf closed 1 year ago

sergeyf commented 1 year ago

Hello,

Thanks for the great package! Here is an example of a failure when there are enough samples, but the model complains that there are not. Works fine when exact=True

import numpy as np
import genieclust

X = np.zeros((3, 768))
k = 2
g = genieclust.Genie(n_clusters=k, gini_threshold=0.01, exact=False)
labels = g.fit_predict(X)

Error trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[328], line 7
      5 k = 2
      6 g = genieclust.Genie(n_clusters=k, gini_threshold=0.01, exact=False)
----> 7 labels = g.fit_predict(X)

File .../lib/python3.8/site-packages/genieclust/genie.py:548, in GenieBase.fit_predict(self, X, y)
    520 def fit_predict(self, X, y=None):
    521     """
    522     Perform cluster analysis of a dataset and return the predicted labels.
    523 
   (...)
    546 
    547     """
--> 548     self.fit(X)
    549     return self.labels_

File .../lib/python3.8/site-packages/genieclust/genie.py:1051, in Genie.fit(self, X, y)
    972 """
    973 Perform cluster analysis of a dataset.
    974 
   (...)
   1047 
   1048 """
   1049 cur_state = self._check_params()  # re-check, they might have changed
-> 1051 cur_state = self._get_mst(X, cur_state)
   1053 if cur_state["verbose"]:
   1054     print("[genieclust] Determining clusters with Genie++.", file=sys.stderr)

File .../lib/python3.8/site-packages/genieclust/genie.py:511, in GenieBase._get_mst(self, X, cur_state)
    509     cur_state = self._get_mst_exact(X, cur_state)
    510 else:
--> 511     cur_state = self._get_mst_approx(X, cur_state)
    513 # this might be an "intrinsic" dimensionality:
    514 self.n_features_  = cur_state["n_features"]

File .../lib/python3.8/site-packages/genieclust/genie.py:484, in GenieBase._get_mst_approx(self, X, cur_state)
    480     d_core = internal.get_d_core(nn_dist, nn_ind, cur_state["M"])
    483 if mst_dist is None or mst_ind is None:
--> 484     mst_dist, mst_ind = internal.mst_from_nn(
    485         nn_dist,
    486         nn_ind,
    487         d_core,
    488         stop_disconnected=False,
    489         verbose=cur_state["verbose"])
    490     # We can have a forest here...
    492 self.n_samples_   = n_samples

File .../lib/python3.8/site-packages/genieclust/internal.pyx:294, in genieclust.internal.__pyx_fuse_0mst_from_nn()

File .../lib/python3.8/site-packages/genieclust/internal.pyx:381, in genieclust.internal.mst_from_nn()

ValueError: k >= n
gagolews commented 1 year ago

Thanks for the report, the fix is on the way.