TutteInstitute / fast_hdbscan

A fast multi-core implementation of HDBSCAN for low dimensional Euclidean spaces
BSD 2-Clause "Simplified" License
78 stars 8 forks source link

The number of clusters is less than min_cluster_size. #14

Closed cccxg closed 12 months ago

cccxg commented 12 months ago

That's so weird.

codes:

def load_dataset(path:str): 
    df = pd.read_csv(path)
    labels = df["Label"].values
    n_clusters = np.max(labels) + 1
    X = df.drop(columns="Label").values
    return X, labels, n_clusters

from fast_hdbscan import fast_hdbscan

X, tru_labels, nc = load_dataset("./Real-world/iris.csv")
print("the number of clusters is: ", nc)

labels = fast_hdbscan.HDBSCAN(min_cluster_size=nc).fit_predict(X)
print("pred labels are: ", labels)
print("true labels are: ", tru_labels)

outputs:

the number of clusters is:  3
pred labels are:  [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0]
true labels are:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]