TutteInstitute / fast_hdbscan

A fast multi-core implementation of HDBSCAN for low dimensional Euclidean spaces
BSD 2-Clause "Simplified" License
78 stars 8 forks source link

Add HDBSCAN*(BC) implementation. #21

Closed nsakr closed 2 weeks ago

nsakr commented 3 weeks ago

This PR implements the semi-supervised HDBSCAN*(BC) algorithm published in this paper: Castro Gertrudes, J., Zimek, A., Sander, J. et al. A unified view of density-based methods for semi-supervised clustering and classification. Data Min Knowl Disc 33, 1894–1952 (2019).

It can be used as such:

model = HDBSCAN(semi_supervised=True, ss_algorithm="bc").fit(X, partial_labels)

semi_supervised: False by default. Set to True if you have some labels and want to use fast_hdbscan in semi-supervised mode.

ss_algorithm: None by default. Can be set to "bc" which is the HDBSCAN(BC) algorithm or "bc_without_vc" which is short for "HDBSCAN(BC) without virtual nodes". This gives the user the option to consider or not consider virtual nodes. Virtual nodes denote pre-labeled singleton noise objects. Refer to Fig. 6 in the paper.

partial_labels in this example is simply an array. Unlabelled points should be set to -1.

Here's a basic example with the Iris data:

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import random
import numpy as np

from fast_hdbscan import (
    HDBSCAN,
    fast_hdbscan,
)

iris = datasets.load_iris()

# Function used to assign a label to only a fraction of the labels.
def sample_labels(labels, frac=0.05, seed=1):
    random.seed(seed)
    k = int(len(labels)*frac)
    indices = random.sample(list(range(0, len(labels))), k)
    partial_labels = np.array([-1]*len(labels))
    partial_labels[indices] = labels[indices]
    return(partial_labels)

X = iris.data
y = iris.target
X = StandardScaler().fit_transform(X)

partial_labels = sample_labels(y, frac=0.05)

# Build model
model = HDBSCAN(semi_supervised=True, ss_algorithm="bc").fit(X, partial_labels)
print(f"cluster_labels for Iris: {model.labels_}")
lmcinnes commented 3 weeks ago

It looks like you need a default argment option for data_labels in the fast_hdbscan function. A default of None and a check that you aren't doing semi-supevised and have labels equal to None seems like a natural choice.