elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
785 stars 323 forks source link

Extracting hierarchy from DBSCAN #69

Closed neilireson closed 4 years ago

neilireson commented 4 years ago

I use ELKI within my Java code and have been trying to export the cluster hierarchy generated with HDBSCAN, however this just results in a single root cluster with the child cluster all being leaves.

In order to "fix" this I changed the collectChildren method in the HDBSCANHierarchyExtraction class. Replacing collectChildren(temp, clustering, child, clus, flatten); with finalizeCluster(child, clustering, clus, flatten);

This does seem to result in a proper hierarchy, although does return all clusters (including those with fewer than minPts data points). However, my understanding of the code is not enough to know whether this is in any way sensible or correct.

I use the following code to output the hierarchy:

{
        ...create clustering code...

        Relation<NumberVector> coords = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD_2D);
        List<Cluster<DendrogramModel>> topClusters = clustering.getToplevelClusters();
        Hierarchy<Cluster<DendrogramModel>> hierarchy = clustering.getClusterHierarchy();

        for (Cluster<DendrogramModel> cluster : topClusters) {
            System.out.println("---------------------------------");
            outputHierarchy(cluster, hierarchy, coords, "");
        }
}

private static void outputHierarchy(Cluster<DendrogramModel> cluster,
                                    Hierarchy<Cluster<DendrogramModel>> hierarchy,
                                    Relation<NumberVector> coords,
                                    String indent) {
    final DBIDs ids = cluster.getIDs();
    DendrogramModel model = cluster.getModel();
    System.out.format("%s%s: %d : %.3f%n", indent, cluster.getName(), ids.size(), model.getDistance());
    if (!ids.isEmpty()) {
        System.out.print(indent);
        for (DBIDIter iter = ids.iter(); iter.valid(); iter.advance()) {
            System.out.print(Arrays.toString(coords.get(iter).toArray()));
        }
        System.out.println();
    }
    if (hierarchy != null) {
        if (hierarchy.numChildren(cluster) > 0) {
            for (It<Cluster<DendrogramModel>> iter = hierarchy.iterChildren(cluster); iter.valid();
                 iter.advance()) {
                outputHierarchy(iter.get(), hierarchy, coords, indent + "  ");
            }
        }
    }
}
kno10 commented 4 years ago

That fix is not correct - the method does need to switch between a "flattening" and a non-flattening strategy. The HDBSCAN logic is based on a rather complex notion of cluster stability; and there could of course be a bug in how this value is computed. But from a quick look, this is supposed to yield a flat clustering then. Because of this notion of cluster stability, I'm not sure if such a hierarchical result is well defined with the HDBSCAN extractor (maybe the clustering.hierarchical.extraction.SimplifiedHierarchyExtraction is more of what you were looking for?)

neilireson commented 4 years ago

Thanks for that. It seems like extracting what I need will require a little more consideration. I have used the Python HDBSCAN code and that does prune the tree to produce reasonable clusters but still provide a hierarchy. Perhaps I'll try to compare the code, however my knowledge of Python is scant.