DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.3k stars 559 forks source link

Extend the InterclusterDistance visualizer #582

Open bbengfort opened 6 years ago

bbengfort commented 6 years ago

The InterclusterDistance visualizer is our newest cluster visualization, and while it's been implemented completely, there are still a few updates I'd like to make to it:

Notes on colors

Right now the facecolor of the clusters is hard coded to #2e719344 and the edgecolor of the clusters is hard coded to #2e719399 note the 44 and 99 on the colors respectively, these set the opacity of the color; the edge is more opaque than the face of the cluster in order to allow better visibility of clusters that overlap.

I would like to support the user specifying a color for all clusters or a colormap/colors for each cluster as well as the ability to specify the face opacity. If the user specifies these things, then we have to compute the relative alpha (opacity) for both the edge and the face to maintain the currently hardcoded behavior.

Notes on supported algorithms

Right now we use the cluster_centers_ attribute of the model to embed the centers into 2 dimensional space and the labels_ attribute to score/size the clusters. Unfortunately, not all clustering algorithms have these attributes, so we need to extend the cluster_center_ property on the visualizer to either find a different attribute or to compute the cluster centers some how. Below is a listing of various clustering algorithms and their attributes.

We would like to ensure support for the following clustering algorithms:

AgglomerativeClustering (Ward and Average)
 - children_
 - labels_
 - n_components_
 - n_leaves_

Birch
 - dummy_leaf_
 - fit_
 - labels_
 - partial_fit_
 - root_
 - subcluster_centers_
 - subcluster_labels_

FeatureAgglomeration
 - children_
 - labels_
 - n_components_
 - n_leaves_

decomposition.LatentDirichletAllocation
 - bound_
 - components_
 - doc_topic_prior_
 - exp_dirichlet_component_
 - n_batch_iter_
 - n_iter_
 - random_state_
 - topic_word_prior_

It would be great if we could find support for the following clustering algorithms, but it's not clear if it's possible or not either because there is no obvious centers or labels:

DBSCAN
 - components_
 - core_sample_indices_
 - labels_

mixture.GaussianMixture
 - converged_
 - covariances_
 - lower_bound_
 - means_
 - n_iter_
 - precisions_
 - precisions_cholesky_
 - weights_

SpectralClustering
 - affinity_matrix_
 - labels_

We already have support for the following clustering algorithms (using the cluster_centers_ attribute for embedding and the labels_ attribute for scoring):

AffinityPropagation
 - affinity_matrix_
 - cluster_centers_
 - cluster_centers_indices_
 - labels_
 - n_iter_

KMeans
 - cluster_centers_
 - inertia_
 - labels_
 - n_iter_

MiniBatchKMeans
 - cluster_centers_
 - counts_
 - inertia_
 - init_size_
 - labels_
 - n_iter_

MeanShift
 - cluster_centers_
 - labels_
jaywalkingbackwards commented 5 years ago

Greetings! Can I use it via Anaconda? Can't install 0.9 version of yellowbrick in order to use InterclusterDistance. If I can't - tell me, please, if there is another easy way to find intercluster distance of sckit's k-means. Thanks!

lwgray commented 5 years ago

@jaywalkingbackwards we haven’t deployed v0.9 to anaconda yet. It is one of our highest priorities. I am not aware of a different way to find intercluster distance. Have you taken a look at our code? https://github.com/DistrictDataLabs/yellowbrick/blob/develop/yellowbrick/cluster/icdm.py

@bbengfort or @rebeccabilbro any comments?

bbengfort commented 5 years ago

@jaywalkingbackwards version 0.9 has been released to conda - if you update your Yellowbrick install you should have access to ICDM now!