Open bbengfort opened 6 years ago
Hi quick question about the documentation. Should the changes be made to the .rst file AND the README? Or strictly just the .rst?
Hi there @lacanlale - thanks for checking out Scikit-Yellowbrick!
Yes, for enhancing the documentation of the KElbowVisualizer
, the updates should go in the .rst files that generate the Sphinx docs (in this case, yellowbrick/docs/api/cluster/elbow.rst
). As you can see, the current version of the elbow visualizer doc page is pretty thin on description of the use cases and work flow for doing k-selection, so there's a lot of potential for fleshing out the page with some more instructive paragraphs! You may also want to check out the section on k-elbow in the examples.ipynb document, which might have some useful content that could be repurposed here.
Hi thank you for the response, I will be taking time to look into it!
Hey I wanted to look into the inertia metric but I'm confused as to what to look for and the relevancy of what's linked. Can I have some clarification please?
@lacanlale sorry for the confusion, I'm happy to clarify with the story so far. Right now our elbow method uses 3 metrics, distortion score, calinski-harbaz, and silhouette. Of these, distortion scores are most commonly used to create elbows (in fact, there is some question about whether or not the other two metrics create elbows at all, but they do create sharp peaks, which are easy to visually identify).
scikit-learn's KMeans implementation uses inertia (a distortion-like metric) to optimize its clusters, and stores those values in the estimator as inertia_
, meaning that other implementations of the elbow visualizer simply use this property.
Here's the catch, though - this is only available on KMeans
. As a result, we simply implemented our own distortion score metric. However, inertia is different enough that it might be useful and it improves the performance of the visualizer since an extra computation is not needed so we'd like to add it in as a (non-default) metric:
inertia_
inertia_
property Hopefully that helps and provides some additional background. Let me know if you have any additional questions!
Hi @lacanlale, thank you for contributing to Yellowbrick! KMeans tries to separate the samples into n
groups of equal variance, minimizing inertia which is also known as within-cluster sum-of-squares (WCSS). I highly recommend scikit-learn's documentation and I'm also tagging @mattharrison, who had originally commented on #91 about the possibility of adding inertia_
as a metric to the KElbowVisualizer
, so he might be willing to chime in about how he uses it in his own work or explains it to his students.
Thanks for the responses (both of you!) This clarifies a lot, and I'll definitely be looking into it!
Hi, I found the dataset better for Elbow Method regarding the stretch goal - Find example dataset with clear elbow curve demonstration to use in documentation. The dataset is created with make_blobs of 500 points with 8 features and 4 centroids.
Unless anyone is close to completion, I'm at Pycon 2019 Sprints and I'm going to take a look into adding in inertia_.
@GrahamStein were you still interested in tackling this issue?
I have it done, just need to clean it up a bit. Sadly I'm booked out working on life/house stuff for the remainder of the week.
On Sat, May 11, 2019 at 1:53 PM Larry Gray notifications@github.com wrote:
@GrahamStein https://github.com/GrahamStein were you still interested in tackling this issue?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DistrictDataLabs/yellowbrick/issues/515#issuecomment-491531537, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSBLV5NBYYBQIXTGS4YSZ3PU4BZ3ANCNFSM4FLA75BA .
The following tasks from #91 are left over to finalize the implementation of the
KElbowVisualizer
and improve it.Note to contributors: items in the below checklist don't need to be completed in a single PR; if you see one that catches your eye, feel to pick it off the list!
k
Stretch goals:
n_jobs
argument (Note: this one is expert-level! See discussion here for more details)The stretch goals can be converted to their own issues if not completed in this feature request.