Let `KElbowVisualizer` use all the distance metrics supported by sklearn

DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.

http://www.scikit-yb.org/

Apache License 2.0

4.29k stars 559 forks source link

Let `KElbowVisualizer` use all the distance metrics supported by sklearn #1296

Closed stergion closed 1 year ago

stergion commented 1 year ago

Describe the solution you'd like Is there a reason KElbowVisualizer can use only the distance metrics in DISTANCE_METRICS?

The distortion_score relies on sklearn.metrics.pairwise.pairwise_distances(), that supports distance metrics not included in DISTANCE_METRICS. For the silhouette score KElbowVisualizer uses sklearn.metrics.cluster._unsupervised.silhouette_score(), that also supports metrics not included in DISTANCE_METRICS.

I don't see a reason why there is this limitation.

bbengfort commented 1 year ago

@stergion it actually is supposed to use any callable that the user supplies; the docstring says:

distance_metric : str or callable, default='euclidean' The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by sklearn's metrics.pairwise.pairwise_distances. If X is the distance array itself, use metric="precomputed".

So I think we need to have the DISTANCE_METRICS check only happen if the input is a string. I'll open a quick PR for a fix shortly.

bbengfort commented 1 year ago

@stergion PR open if you'd like to take a look.

bbengfort commented 1 year ago

@stergion and it should be able to use any sklearn distance metric; the DISTANCE_METRICS constant is out of date with the new version of scikit-learn; luckily sklearn now provides a sklearn.metrics.DistanceMetric.get_metric function in v1.0 that will make it easier for us to check if the distance metric is valid.

To answer why we check if the metric is valid or not - we're trying to prevent the user from running a potentially lengthy fit process before the exception is raised by scikit-learn. But you're right, we need a generic way to allow scikit-learn to change without breaking Yellowbrick, and I think this new metrics functionality is the right answer.

stergion commented 1 year ago

@bbengfort thank you for your time, you answered all my questions and solved the problem.

bbengfort commented 1 year ago

@stergion thank you so much for using Yellowbrick and for all your contributions to the library!