Closed nsakr closed 2 weeks ago
It looks like you need a default argment option for data_labels in the fast_hdbscan
function. A default of None
and a check that you aren't doing semi-supevised and have labels equal to None seems like a natural choice.
This PR implements the semi-supervised HDBSCAN*(BC) algorithm published in this paper: Castro Gertrudes, J., Zimek, A., Sander, J. et al. A unified view of density-based methods for semi-supervised clustering and classification. Data Min Knowl Disc 33, 1894–1952 (2019).
It can be used as such:
model = HDBSCAN(semi_supervised=True, ss_algorithm="bc").fit(X, partial_labels)
semi_supervised
:False
by default. Set to True if you have some labels and want to use fast_hdbscan in semi-supervised mode.ss_algorithm
:None
by default. Can be set to "bc" which is the HDBSCAN(BC) algorithm or "bc_without_vc" which is short for "HDBSCAN(BC) without virtual nodes". This gives the user the option to consider or not consider virtual nodes. Virtual nodes denote pre-labeled singleton noise objects. Refer to Fig. 6 in the paper.partial_labels
in this example is simply an array. Unlabelled points should be set to -1.Here's a basic example with the Iris data: