DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.29k stars 557 forks source link

Improvements to K-Elbow Visualizer #515

Open bbengfort opened 6 years ago

bbengfort commented 6 years ago

The following tasks from #91 are left over to finalize the implementation of the KElbowVisualizer and improve it.

Note to contributors: items in the below checklist don't need to be completed in a single PR; if you see one that catches your eye, feel to pick it off the list!

Stretch goals:

The stretch goals can be converted to their own issues if not completed in this feature request.

lacanlale commented 6 years ago

Hi quick question about the documentation. Should the changes be made to the .rst file AND the README? Or strictly just the .rst?

rebeccabilbro commented 6 years ago

Hi there @lacanlale - thanks for checking out Scikit-Yellowbrick! Yes, for enhancing the documentation of the KElbowVisualizer, the updates should go in the .rst files that generate the Sphinx docs (in this case, yellowbrick/docs/api/cluster/elbow.rst). As you can see, the current version of the elbow visualizer doc page is pretty thin on description of the use cases and work flow for doing k-selection, so there's a lot of potential for fleshing out the page with some more instructive paragraphs! You may also want to check out the section on k-elbow in the examples.ipynb document, which might have some useful content that could be repurposed here.

lacanlale commented 6 years ago

Hi thank you for the response, I will be taking time to look into it!

lacanlale commented 6 years ago

Hey I wanted to look into the inertia metric but I'm confused as to what to look for and the relevancy of what's linked. Can I have some clarification please?

bbengfort commented 6 years ago

@lacanlale sorry for the confusion, I'm happy to clarify with the story so far. Right now our elbow method uses 3 metrics, distortion score, calinski-harbaz, and silhouette. Of these, distortion scores are most commonly used to create elbows (in fact, there is some question about whether or not the other two metrics create elbows at all, but they do create sharp peaks, which are easy to visually identify).

scikit-learn's KMeans implementation uses inertia (a distortion-like metric) to optimize its clusters, and stores those values in the estimator as inertia_, meaning that other implementations of the elbow visualizer simply use this property.

Here's the catch, though - this is only available on KMeans. As a result, we simply implemented our own distortion score metric. However, inertia is different enough that it might be useful and it improves the performance of the visualizer since an extra computation is not needed so we'd like to add it in as a (non-default) metric:

  1. Add this metric to the list of valid metrics
  2. Compute this metric by simply grabbing it off of each KMeans estimator from inertia_
  3. Raise an exception if the clustering estimator does not have the inertia_ property
  4. Add documentation to describe when inertia should be used over distortion and what the difference is.

Hopefully that helps and provides some additional background. Let me know if you have any additional questions!

Kautumn06 commented 6 years ago

Hi @lacanlale, thank you for contributing to Yellowbrick! KMeans tries to separate the samples into n groups of equal variance, minimizing inertia which is also known as within-cluster sum-of-squares (WCSS). I highly recommend scikit-learn's documentation and I'm also tagging @mattharrison, who had originally commented on #91 about the possibility of adding inertia_ as a metric to the KElbowVisualizer, so he might be willing to chime in about how he uses it in his own work or explains it to his students.

lacanlale commented 6 years ago

Thanks for the responses (both of you!) This clarifies a lot, and I'll definitely be looking into it!

vivienneinus commented 5 years ago

Hi, I found the dataset better for Elbow Method regarding the stretch goal - Find example dataset with clear elbow curve demonstration to use in documentation. The dataset is created with make_blobs of 500 points with 8 features and 4 centroids. images_calinski_harabaz

GrahamStein commented 5 years ago

Unless anyone is close to completion, I'm at Pycon 2019 Sprints and I'm going to take a look into adding in inertia_.

lwgray commented 5 years ago

@GrahamStein were you still interested in tackling this issue?

GrahamStein commented 5 years ago

I have it done, just need to clean it up a bit. Sadly I'm booked out working on life/house stuff for the remainder of the week.

On Sat, May 11, 2019 at 1:53 PM Larry Gray notifications@github.com wrote:

@GrahamStein https://github.com/GrahamStein were you still interested in tackling this issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DistrictDataLabs/yellowbrick/issues/515#issuecomment-491531537, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSBLV5NBYYBQIXTGS4YSZ3PU4BZ3ANCNFSM4FLA75BA .