TutteInstitute / fast_hdbscan

A fast multi-core implementation of HDBSCAN for low dimensional Euclidean spaces
BSD 2-Clause "Simplified" License
93 stars 8 forks source link

Plotting a a condense tree result using the hdbscan.plots.CondensedTree class #25

Open u3ks opened 1 week ago

u3ks commented 1 week ago

Hi all,

Im trying to plot the output of fast_hdbscan.cluster_trees.condense_tree using the hdbscan.plots.CondensedTree class . I tried converting the result like so:

ct_raw = np.rec.fromarrays((ct[0], ct[1], ct[2], ct[3]), dtype=[(' parent', np.intp),('child', np.intp),('lambda_val', float),('child_size', np.intp)])

Then passing it to the constructor - CondensedTree(ct_raw) - but i get an error that there are some parent nodes without children in the ct_raw array.

Specifically, the .max() call below (from the hdbscan.plots.CondensedTree.get_plot_data) throws the exception that its being called on an empty array:

` for c in range(last_leaf, root - 1, -1):

        cluster_bounds[c] = [0, 0, 0, 0]

        c_children = self._raw_tree[self._raw_tree['parent'] == c]
        current_size = np.sum(c_children['child_size'])
        current_lambda = cluster_y_coords[c]
        cluster_max_size = current_size
        cluster_max_lambda = c_children['lambda_val'].max()`

Do you have any pointers how to convert between the two representations or how to change the get_plot_data function?

lmcinnes commented 5 days ago

You may have ended up with a condensed forest instead of a condensed tree. That shouldn't really be possible, but perhaps there is a bug that makes it possible? I would need to see the actual tree data to diagnose...

u3ks commented 4 days ago

Actually, I think I found the issue - it was because I was testing out the new sample weights functionality and I had a sample weight instance that was larger than the specified min_cluster_size.

Maybe throwing a warning for this in the initial tree construction would be beneficial?

lmcinnes commented 3 days ago

Yes, that might be something that would be sensible. The sample weight stuff is pretty new so it isn't well tested yet.