DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.29k stars 557 forks source link

Improvements to Silhouette Visualizer #516

Closed bbengfort closed 5 years ago

bbengfort commented 6 years ago

The following improvements to the Silhouette Visualizer are left over from #91:

Note to contributors: items in the below checklist don't need to be completed in a single PR; if you see one that catches your eye, feel to pick it off the list!

gokriznastic commented 6 years ago

@bbengfort for the third item in the checklist i.e.

Instead of hard fixing the limits of the X-axis from -1.0 to 1.0; be more flexible so that the visualizer has a better display (or give the user the option of setting the limits).

it would be counter-intuitive to provide that flexibility because Silhouette Scores by mathematical definition range from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

Please correct me if I did not understand this improvement as you intended.

rebeccabilbro commented 6 years ago

Hi @gokriznastic and thanks for your interest in Yellowbrick!

What @bbengfort is suggesting here is not to change the scale along which silhouette scores are computed (which is indeed from -1 to 1), but to change the way that our SilhouetteScore visualizer scales the axes of the plot. In other words, if the visualizer computes the silhouette scores for our clusters, and they fall between -0.7 and 0.6, it should adjust the axes of the plot accordingly. One strategy would be to do something like...

ax.xlim([min(self.silhouette_samples_), max(self.silhouette_samples_)])

...in L198 (though some additional padding on either side might be necessary, so this might take a bit of plotting experimentation). Would you be interested in working on this?

gokriznastic commented 6 years ago

Oh okay! Thank you for the clarification @rebeccabilbro

I am totally new to open source contributions but I would definitely like to give this a shot! I'll do some plotting and look at what works best for this visualizer.

Also, for the second item in checklist i.e. finding a real world example, do you need some real world dataset with well-defined clusters, to see how the Silhouette Visualizer works on it?

Thank you again.

rebeccabilbro commented 6 years ago

Awesome, so glad that you're game to work on this one @gokriznastic!

Yes, for the second item on the checklist, we could really use a real world dataset that has nicely defined clusters -- it can be something from the UCI machine learning repository or from a project you've worked on, so long as it can be shared openly with all users. Ideas welcome!

gokriznastic commented 6 years ago

@rebeccabilbro thank you for the chance!

The first example that comes from the top of my mind would be the Iris Flower Dataset. It can be found here: https://archive.ics.uci.edu/ml/datasets/iris

Here is an example silhouette visualization I came across while working with it previously: unnamed-chunk-15-1

Although this particular dataset has its own challenges due to the fact that only 1 out of 3 clusters are linearly seperable. What I'll do is see how the dataset performs on Yellowbrick's visualizer. Also, I'll keep looking for other well clustered datasets and let you guys know the results.

gokriznastic commented 6 years ago

Hey @rebeccabilbro @bbengfort being new to open-source am really facing some problems in testing the code. What is the most efficient way to test the changes I just made in my local branch? Like for example I edited yellowbrick/cluster/silhouette.py . Now how do I test if the new visualizer is working as I expect it to, before making a pull request.

rebeccabilbro commented 6 years ago

Hi @gokriznastic, great question! There are a couple ways to go about experimenting with new prototypes of YB visualizers, but my preferred way is to do it is by pasting the code into a Jupyter notebook. Here's an example of a notebook I was experimenting with when I was trying to fix a bug in our ROCAUC visualizer a couple months ago. You could do something similar, starting with

import os
import sys 

# Modify the path 
sys.path.append("..")

import yellowbrick as yb
import matplotlib.pyplot as plt 

and then pasting in your new version of the SilhouetteVisualizer class. Then you can test it out with a dataset to see if it's plotting the way you expect it to. Hope this helps and let me know if I can answer any other questions!

gokriznastic commented 6 years ago

@rebeccabilbro thanks a lot! this worked. I have added an example of real-world dataset for clustering in my PR #639 Please have a look!

Also, I tried your suggestion of replacing the following line: https://github.com/DistrictDataLabs/yellowbrick/blob/eac4e54574f9f09633331a2b82abfc45b04822a5/yellowbrick/cluster/silhouette.py#L198

by ax.xlim([min(self.silhouette_samples_), max(self.silhouette_samples_)]) but that creates some further visualization problems that need some more experimentation to solve. Though I'm positive it can be figured out.

gokriznastic commented 6 years ago

@rebeccabilbro I have the PR ready for item 3 on checklist. In my opinion instead of hard fixing, letting the visualizer decide the range gives more flexibility indeed; but giving the user the option of setting the limits doesn't seem right as Silhouette Scores are constrained by certain values. I would like to know your opinions on this.

Also, I have added appropriate comments wherever I've changed the code, but in case you have any query regarding why something has been done please ask.

Edit: seems like my PR is failing tests, can you tell me how do I fix such issues and check where the code went wrong?

rebeccabilbro commented 6 years ago

Ok, thank you @gokriznastic! I will take a look and get back to you in the next few days!

gokriznastic commented 5 years ago

@bbengfort what kind of label due you want for the red axvline? Should it display the average Silhouette score?

Also, you can tick the 2nd and 3rd item from the checklist if you like.

bbengfort commented 5 years ago

@gokriznastic glad to hear you're willing to keep going with the updates to this visualizer. The label should indicate the value of the SilhouetteVisualizer.silhouette_score_ property, which is what is plotted by the red axvline: L178. This may be as simple as just adding a label="" keyword argument to the axvline; it'll just be a matter of how we communicate this score. Note that if you need to add math to the matplotlib figure, you can format the label with $ as in $S_i=0.2$ will render the latex math. Not sure if this is necessary or not, but perhaps there is a symbol for the mean silhouette score.

As for tick box 2 - I appreciate the Iris data set, but I was hoping for something a bit more unique to us.

Also, let's not forget the user specified limits to the figure!

Thanks again!

gokriznastic commented 5 years ago

@bbengfort thanks for the explanation. I'll try working on these one by one in my next PR. And yes I remember the user defined limits to the figure.

Thanks again!

mgarod commented 5 years ago

I'm looking into parts of this issue for pycon sprints.

Update: Cluster labels to the y-axis PR: https://github.com/DistrictDataLabs/yellowbrick/pull/833

mgarod commented 5 years ago

Update: Color customization PR: https://github.com/DistrictDataLabs/yellowbrick/pull/837

mgarod commented 5 years ago

Ran into a problem with tests on the color customization PR. https://github.com/DistrictDataLabs/yellowbrick/pull/837#issuecomment-489758466

The tests have failed because this PR is now using matplotlib's version of the Set1 color map, and no longer yellowbrick's set1 color map found in styles/palettes.py.

Need clarification if this is a matter of simply adjusting the tests to match the new logic, or if this is a breaking change that should be corrected.

Is the codebase moving away from palette.py and towards colors.py? I assumed we were with this TODO https://github.com/DistrictDataLabs/yellowbrick/blob/develop/yellowbrick/cluster/silhouette.py#L151

Update: A fix has been found and pushed to the PR branch.

mgarod commented 5 years ago

Highlighting an issue raised with @lwgray for PR #833 moving labels to y-axis, copied from that PR:

The tests do not show axis labels currently: current_test_image

These changes make the test image appear to have removed the cluster label entirely PR_image

ANSWER: Exclusion of using finalize() in the tests is intended as can be seen in tests/base.py

mgarod commented 5 years ago

Update: PR to add legend describing red dashed line #839

mgarod commented 5 years ago

Bullet point Add a legend/annotation that describes the average clustering coefficient (e.g. label the red axvline) has been addressed with #839 merged

badri-thinker commented 4 years ago

Hi @gokriznastic and thanks for your interest in Yellowbrick!

What @bbengfort is suggesting here is not to change the scale along which silhouette scores are computed (which is indeed from -1 to 1), but to change the way that our SilhouetteScore visualizer scales the axes of the plot. In other words, if the visualizer computes the silhouette scores for our clusters, and they fall between -0.7 and 0.6, it should adjust the axes of the plot accordingly. One strategy would be to do something like...

ax.xlim([min(self.silhouette_samples_), max(self.silhouette_samples_)])

...in L198 (though some additional padding on either side might be necessary, so this might take a bit of plotting experimentation). Would you be interested in working on this?

Is there anyway to add cluster labels onto the silhouette plot? I have named clusters but would like to plot silhouette scores for each embedding and compare

badri-thinker commented 4 years ago

is there any way to add cluster labels onto silhouette plot

bbengfort commented 4 years ago

@badri-thinker there are a couple of ways to add cluster labels manually by accessing the viz.ax property directly. You can either use viz.ax.set_yticklabels, ensuring that there is a y tick in the center of every cluster (so a bit of computation is required here); or you can use text annotations to add text labels directly to the plot wherever you would like.

Does that help? If you need more specific advice, I recommend posting a question on StackOverflow, tagged with Yellowbrick and some example code.