gagolews / genieclust

Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection - in Python and R
https://genieclust.gagolewski.com
Other
58 stars 10 forks source link

apply_genie(), apply_gic(): return a hierarchy, not a k-partition #8

Closed gagolews closed 4 years ago

gagolews commented 4 years ago

https://github.com/gagolews/genie/blob/master/src/hclust2_result.h https://github.com/gagolews/genie/blob/master/src/hclust2_result.cpp

merge matrix

?hclust Value in R

 merge: an n-1 by 2 matrix.  Row i of ‘merge’ describes the merging
          of clusters at step i of the clustering.  If an element j in
          the row is negative, then observation -j was merged at this
          stage.  If j is positive then the merge was with the cluster
          formed at the (earlier) stage j of the algorithm.  Thus
          negative entries in ‘merge’ indicate agglomerations of
          singletons, and positive entries indicate agglomerations of
          non-singletons.

  height: a set of n-1 real values (non-decreasing for ultrametric
          trees).  The clustering _height_: that is, the value of the
          criterion associated with the clustering ‘method’ for the
          particular agglomeration.

   order: a vector giving the permutation of the original observations
          suitable for plotting, in the sense that a cluster plot using
          this ordering and matrix ‘merge’ will not have crossings of
          the branches.
gagolews commented 4 years ago

https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html

https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage

 A :math:`(n-1)` by 4 matrix ``Z`` is returned. At the
    :math:`i`-th iteration, clusters with indices ``Z[i, 0]`` and
    ``Z[i, 1]`` are combined to form cluster :math:`n + i`. A
    cluster with an index less than :math:`n` corresponds to one of
    the :math:`n` original observations. The distance between
    clusters ``Z[i, 0]`` and ``Z[i, 1]`` is given by ``Z[i, 2]``. The
    fourth value ``Z[i, 3]`` represents the number of original
    observations in the newly formed cluster.
gagolews commented 4 years ago

sklearn has https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering

Attributes

n_clusters_      int

    The number of clusters found by the algorithm. If distance_threshold=None, it will be equal to the given n_clusters.

labels_    ndarray of shape (n_samples)

    cluster labels for each point

n_leaves_     int

    Number of leaves in the hierarchical tree.

n_connected_components_    int

    The estimated number of connected components in the graph.

children_   array-like of shape (n_samples-1, 2)

    The children of each non-leaf node. Values less than n_samples correspond to leaves of the tree which are the original samples. A node i greater than or equal to n_samples is a non-leaf node and has children children_[i - n_samples]. Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form node n_samples + i

distances_    array of shape (n_samples-1, 1) ????

 Not documented

see https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html for conversion to scipy

gagolews commented 4 years ago

Just like in genie::grup::HClustResult, when calling ds.merge(i1, i2), the following info can be stored:

links has n-1 rows and 2 columns, height is of length n-1

negative links[iter,:] when requested an incomplete hierarchy (e.g., K-cut for specific K)

negative heights == non-standard merges, e.g.:

gagolews commented 4 years ago

Now: links[iter]==j -- index of merged mst_i[j]