Add HDBSCAN*(BC) implementation.

This PR implements the semi-supervised HDBSCAN*(BC) algorithm published in this paper: Castro Gertrudes, J., Zimek, A., Sander, J. et al. A unified view of density-based methods for semi-supervised clustering and classification. Data Min Knowl Disc 33, 1894–1952 (2019).

It can be used as such:

model = HDBSCAN(semi_supervised=True, ss_algorithm="bc").fit(X, partial_labels)

semi_supervised: False by default. Set to True if you have some labels and want to use fast_hdbscan in semi-supervised mode.

ss_algorithm: None by default. Can be set to "bc" which is the HDBSCAN(BC) algorithm or "bc_without_vc" which is short for "HDBSCAN(BC) without virtual nodes". This gives the user the option to consider or not consider virtual nodes. Virtual nodes denote pre-labeled singleton noise objects. Refer to Fig. 6 in the paper.

partial_labels in this example is simply an array. Unlabelled points should be set to -1.

Here's a basic example with the Iris data:

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import random
import numpy as np

from fast_hdbscan import (
    HDBSCAN,
    fast_hdbscan,
)

iris = datasets.load_iris()

# Function used to assign a label to only a fraction of the labels.
def sample_labels(labels, frac=0.05, seed=1):
    random.seed(seed)
    k = int(len(labels)*frac)
    indices = random.sample(list(range(0, len(labels))), k)
    partial_labels = np.array([-1]*len(labels))
    partial_labels[indices] = labels[indices]
    return(partial_labels)

X = iris.data
y = iris.target
X = StandardScaler().fit_transform(X)

partial_labels = sample_labels(y, frac=0.05)

# Build model
model = HDBSCAN(semi_supervised=True, ss_algorithm="bc").fit(X, partial_labels)
print(f"cluster_labels for Iris: {model.labels_}")

TutteInstitute / fast_hdbscan

Add HDBSCAN*(BC) implementation. #21