DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.3k stars 559 forks source link

Add pairwise distance metrics to scoring metrics in KElbowVisualizer #1238

Closed lwgray closed 2 years ago

lwgray commented 2 years ago

PR in response to Stackoverflow question: https://stackoverflow.com/questions/69608173/yellowbrick-is-it-possible-to-pass-in-different-pairwise-distance-metrics-for-s

Summary

Sklearn defines a large number of pairwise distance metrics for something like silhouette score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

For e.g. it can be initiated with any of these distance metrics: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]

However, KElbowVisualizer can pass in silhouette as the metric as follows:

KElbowVisualizer(KMeans(), k=(4, 12), metric='silhouette')

And it uses the silhouette score default distance metric, 'euclidean'. I wanted to make it possible to run KElbowVisualizer using a different distance metric than the default

Changes

  1. I added the ability to specify pairwise distance metrics for out scoring functions

Sample Code and Plot

from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

model = KMeans(random_state=0)
visualizer = KElbowVisualizer(KMeans(random_state=0), k=5, metric="distortion", 
                              distance_metric='manhattan', timings=False, 
                              locate_elbow=False)
visualizer.fit(X)
visualizer.finalize()

image

If you are adding or modifying a visualizer, PLEASE include a sample plot here along with the code you used to generate it.

TODOs and questions

Still to do:

Questions for the @DistrictDataLabs/team-oz-maintainers:

CHECKLIST

codecov[bot] commented 2 years ago

Codecov Report

Merging #1238 (0bfea0f) into develop (092c0ca) will increase coverage by 0.01%. The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop    #1238      +/-   ##
===========================================
+ Coverage    90.48%   90.49%   +0.01%     
===========================================
  Files           92       92              
  Lines         5200     5206       +6     
===========================================
+ Hits          4705     4711       +6     
  Misses         495      495              
Impacted Files Coverage Δ
yellowbrick/cluster/elbow.py 97.84% <100.00%> (+0.09%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 092c0ca...0bfea0f. Read the comment docs.