Closed bbengfort closed 5 years ago
@bbengfort for the third item in the checklist i.e.
Instead of hard fixing the limits of the X-axis from -1.0 to 1.0; be more flexible so that the visualizer has a better display (or give the user the option of setting the limits).
it would be counter-intuitive to provide that flexibility because Silhouette Scores by mathematical definition range from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
Please correct me if I did not understand this improvement as you intended.
Hi @gokriznastic and thanks for your interest in Yellowbrick!
What @bbengfort is suggesting here is not to change the scale along which silhouette scores are computed (which is indeed from -1 to 1), but to change the way that our SilhouetteScore
visualizer scales the axes of the plot. In other words, if the visualizer computes the silhouette scores for our clusters, and they fall between -0.7 and 0.6, it should adjust the axes of the plot accordingly. One strategy would be to do something like...
ax.xlim([min(self.silhouette_samples_), max(self.silhouette_samples_)])
...in L198 (though some additional padding on either side might be necessary, so this might take a bit of plotting experimentation). Would you be interested in working on this?
Oh okay! Thank you for the clarification @rebeccabilbro
I am totally new to open source contributions but I would definitely like to give this a shot! I'll do some plotting and look at what works best for this visualizer.
Also, for the second item in checklist i.e. finding a real world example, do you need some real world dataset with well-defined clusters, to see how the Silhouette Visualizer works on it?
Thank you again.
Awesome, so glad that you're game to work on this one @gokriznastic!
Yes, for the second item on the checklist, we could really use a real world dataset that has nicely defined clusters -- it can be something from the UCI machine learning repository or from a project you've worked on, so long as it can be shared openly with all users. Ideas welcome!
@rebeccabilbro thank you for the chance!
The first example that comes from the top of my mind would be the Iris Flower Dataset
.
It can be found here: https://archive.ics.uci.edu/ml/datasets/iris
Here is an example silhouette visualization I came across while working with it previously:
Although this particular dataset has its own challenges due to the fact that only 1 out of 3 clusters are linearly seperable. What I'll do is see how the dataset performs on Yellowbrick's visualizer. Also, I'll keep looking for other well clustered datasets and let you guys know the results.
Hey @rebeccabilbro @bbengfort being new to open-source am really facing some problems in testing the code. What is the most efficient way to test the changes I just made in my local branch?
Like for example I edited yellowbrick/cluster/silhouette.py
. Now how do I test if the new visualizer is working as I expect it to, before making a pull request.
Hi @gokriznastic, great question! There are a couple ways to go about experimenting with new prototypes of YB visualizers, but my preferred way is to do it is by pasting the code into a Jupyter notebook. Here's an example of a notebook I was experimenting with when I was trying to fix a bug in our ROCAUC visualizer a couple months ago. You could do something similar, starting with
import os
import sys
# Modify the path
sys.path.append("..")
import yellowbrick as yb
import matplotlib.pyplot as plt
and then pasting in your new version of the SilhouetteVisualizer
class. Then you can test it out with a dataset to see if it's plotting the way you expect it to. Hope this helps and let me know if I can answer any other questions!
@rebeccabilbro thanks a lot! this worked. I have added an example of real-world dataset for clustering in my PR #639 Please have a look!
Also, I tried your suggestion of replacing the following line: https://github.com/DistrictDataLabs/yellowbrick/blob/eac4e54574f9f09633331a2b82abfc45b04822a5/yellowbrick/cluster/silhouette.py#L198
by ax.xlim([min(self.silhouette_samples_), max(self.silhouette_samples_)])
but that creates some further visualization problems that need some more experimentation to solve. Though I'm positive it can be figured out.
@rebeccabilbro I have the PR ready for item 3 on checklist. In my opinion instead of hard fixing, letting the visualizer decide the range gives more flexibility indeed; but giving the user the option of setting the limits doesn't seem right as Silhouette Scores are constrained by certain values. I would like to know your opinions on this.
Also, I have added appropriate comments wherever I've changed the code, but in case you have any query regarding why something has been done please ask.
Edit: seems like my PR is failing tests, can you tell me how do I fix such issues and check where the code went wrong?
Ok, thank you @gokriznastic! I will take a look and get back to you in the next few days!
@bbengfort what kind of label due you want for the red axvline? Should it display the average Silhouette score?
Also, you can tick the 2nd and 3rd item from the checklist if you like.
@gokriznastic glad to hear you're willing to keep going with the updates to this visualizer. The label should indicate the value of the SilhouetteVisualizer.silhouette_score_
property, which is what is plotted by the red axvline: L178. This may be as simple as just adding a label=""
keyword argument to the axvline; it'll just be a matter of how we communicate this score. Note that if you need to add math to the matplotlib figure, you can format the label with $ as in $S_i=0.2$
will render the latex math. Not sure if this is necessary or not, but perhaps there is a symbol for the mean silhouette score.
As for tick box 2 - I appreciate the Iris data set, but I was hoping for something a bit more unique to us.
Also, let's not forget the user specified limits to the figure!
Thanks again!
@bbengfort thanks for the explanation. I'll try working on these one by one in my next PR. And yes I remember the user defined limits to the figure.
Thanks again!
I'm looking into parts of this issue for pycon sprints.
Update: Cluster labels to the y-axis PR: https://github.com/DistrictDataLabs/yellowbrick/pull/833
Update: Color customization PR: https://github.com/DistrictDataLabs/yellowbrick/pull/837
Ran into a problem with tests on the color customization PR. https://github.com/DistrictDataLabs/yellowbrick/pull/837#issuecomment-489758466
The tests have failed because this PR is now using matplotlib's version of the Set1 color map, and no longer yellowbrick's set1 color map found in styles/palettes.py.
Need clarification if this is a matter of simply adjusting the tests to match the new logic, or if this is a breaking change that should be corrected.
Is the codebase moving away from palette.py
and towards colors.py
?
I assumed we were with this TODO
https://github.com/DistrictDataLabs/yellowbrick/blob/develop/yellowbrick/cluster/silhouette.py#L151
Update: A fix has been found and pushed to the PR branch.
Highlighting an issue raised with @lwgray for PR #833 moving labels to y-axis, copied from that PR:
The tests do not show axis labels currently:
These changes make the test image appear to have removed the cluster label entirely
ANSWER: Exclusion of using finalize()
in the tests is intended as can be seen in tests/base.py
Update: PR to add legend describing red dashed line #839
Bullet point Add a legend/annotation that describes the average clustering coefficient (e.g. label the red axvline)
has been addressed with #839 merged
Hi @gokriznastic and thanks for your interest in Yellowbrick!
What @bbengfort is suggesting here is not to change the scale along which silhouette scores are computed (which is indeed from -1 to 1), but to change the way that our
SilhouetteScore
visualizer scales the axes of the plot. In other words, if the visualizer computes the silhouette scores for our clusters, and they fall between -0.7 and 0.6, it should adjust the axes of the plot accordingly. One strategy would be to do something like...ax.xlim([min(self.silhouette_samples_), max(self.silhouette_samples_)])
...in L198 (though some additional padding on either side might be necessary, so this might take a bit of plotting experimentation). Would you be interested in working on this?
Is there anyway to add cluster labels onto the silhouette plot? I have named clusters but would like to plot silhouette scores for each embedding and compare
is there any way to add cluster labels onto silhouette plot
@badri-thinker there are a couple of ways to add cluster labels manually by accessing the viz.ax
property directly. You can either use viz.ax.set_yticklabels
, ensuring that there is a y tick in the center of every cluster (so a bit of computation is required here); or you can use text annotations to add text labels directly to the plot wherever you would like.
Does that help? If you need more specific advice, I recommend posting a question on StackOverflow, tagged with Yellowbrick and some example code.
The following improvements to the Silhouette Visualizer are left over from #91:
Note to contributors: items in the below checklist don't need to be completed in a single PR; if you see one that catches your eye, feel to pick it off the list!
yellowbrick.datasets
module - perhaps this should be a separate issue?).