Open lwgray opened 6 years ago
If no one else is working on this, I'd be happy to start making some of the updates that Larry suggested.
@DistrictDataLabs/team-oz-maintainers
I started to update examples.ipynb
today and I noticed that it is already quite long, and that does not even include all of the additional visualizers that I need to add. So I think it might be worth considering breaking apart the examples into separate notebooks, such as classification, regression, clustering, feature analysis, model selection, and text modeling.
Once these notebooks are created, we could then always create a "Greatest Hits" type of notebook that contains some of our favorites from each group (and "Greatest Hits, Vol. 2" if we have a hard time selecting our favorites).
Of course, if anyone has any suggestions or different ideas, please just let me know! @DistrictDataLabs/team-oz
@Kautumn06 that would be fine with me, just keep in mind that we have Binder set up, and we'd want to configure it to point to all of these notebooks.
@Kautumn06 Thanks for tackling this 😄
@Kautumn06 @lwgray - are you guys still working on this? @Kautumn06 if you are working on sub-notebooks I'm happy to start on a few!
@joelvanveluwen Thank you for your interest! Yes, I am still working on breaking the original into smaller, more focused notebooks.
I just saw that you offered to work on the Model Selection Tutorial (which is great!), so why don't you check back in with me after that's finished and then, if you're still interested, you can help me with these?
Minor update to examples.ipynb that was raised in this comment: https://github.com/DistrictDataLabs/yellowbrick/pull/479#issuecomment-405290878
Quick update—the notebook of regression examples is finished and was merged earlier this week. Also, I should have the clustering notebook completed by this weekend.
c1a939d only partially closed 470. reopening.
NOTE: The suggestion above, to clean up the esthetics for the example notebooks by using this warning suppression code works, but only in some cases.
import warnings
warnings.filterwarnings('ignore')
The matplotlib
warning described in this issue: #803 is not suppressed by this snippet. Nor is it suppressed if you use any of these arguments in place of ignore
:
"error", "ignore", "always", "default", "module", or "once"
Unless someone else is gonna dive into this, I plan on cleaning up the example notebooks:
@Kautumn06 I have spent some time trying to pull out the Text visualizer example. There are a number of complexities.
As it stands, the first two TSNE Visualizer examples show the same labels (books, cinema, cooking, gaming, sports). The first example should show all of them. The second example should only show three of them (since it focuses on just three classes (sports, cinema, gaming).
There appears to be some logic in the TSNEVisualizer to create labels, but I think the logic is faulty ... I don't think it correlates the classes to the labels in the right way. trying to decipher it.
corpus = load_corpus('hobbies')
tfidf = TfidfVectorizer()
docs = tfidf.fit_transform(corpus.data)
labels = corpus.target
# Create the visualizer and draw the vectors
tsne = TSNEVisualizer()
tsne.fit(docs, labels)
tsne.poof()
versus
# Only visualize the sports, cinema, and gaming classes
tsne = TSNEVisualizer(classes=['sports', 'cinema', 'gaming'])
tsne.fit(docs, labels)
tsne.poof()
Also... the documentation for the TSNEVisualizer doesn't appear to even mention the option of providing classes
.
...
decompose_by : int, default: 50
Specify the number of components for preliminary decomposition, by
default this is 50; the more components, the slower TSNE will be.
labels : list of strings
The names of the classes in the target, used to create a legend.
Labels must match names of classes in sorted order.
colors : list or tuple of colors
Specify the colors for each individual class
...
This is getting complicated fast.
Thank you @chalmerlowe for coming to our sprint and for helping us with this issue! I'll take a look at this tonight since I really want to make sure that our examples notebook are up-to-date and helpful for new Yelllowbrick users. Thanks again for jumping in and helping us make progress on this issue!
@Kautumn06 I would say that the primary issue is:
classes
NOR does it appear to correctly handle the situation where labels
OR classes
OR both are provided. There appears to be some code to try to handle this situation, but a cursory examination left me unconvinced that it does all the right things.Hi @chalmerlowe, I think you may find it helpful to refer to the development version of our documentation, since we've updated the sample code in our docs since our last release. The TSNEVisualizer
example you mentioned had confused others as well, since labels
actually stores y
(the target), not the labels (which the fit method can derive from y
). However, the example was fixed in PR #684 by replacing it with the following:
from sklearn.feature_extraction.text import TfidfVectorizer
from yellowbrick.text import TSNEVisualizer
from yellowbrick.datasets import load_hobbies
# Load the data and create document vectors
corpus = load_hobbies()
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus.data)
y = corpus.target
# Create the visualizer and draw the vectors
tsne = TSNEVisualizer()
tsne.fit(X, y)
tsne.poof()
As for suppressing the warnings, I took a closer look at the earlier suggestion and the reason it didn't work is because the user would need to add that code to their ~/.ipython/profile_default/startup/disable-warnings.py
file, rather than run it directly in a Jupyter notebook cell. However, I don't think we would want to ask users to do this, so I will keep looking for a better solution.
The warning from Anscombe's Quartet (anscombe.py
, see #803) is connected with the color palette. See the issue for a discussion of the proposed solution, to express the color cycle and/or color palettes in Hex. It would definitely take care of the warning. However, it would need to be approved before it is implemented.
@Kautumn06 you did a lot of work to break this into multiple notebooks? Do you have a sense of when we could close this issue?
Hi @bbengfort — yes, I did break out several of the examples into smaller notebooks but I'd still leave it open for now since I'd like to update the examples with our datasets module. Thank you!
Hi @bbengfort — I started updating the examples.ipynb
notebook this morning with the new datasets module and adding some of our new visualizers and I'm starting to think that it may be worth considering actually just deleting it altogether once the new individual notebooks are complete.
At this point, it has grown so long that I'm not sure how helpful it would be to new users. In addition, if someone simply tries to run the entire notebook they may get frustrated at how long it takes to complete because of a couple of visualizers, such as PCA and Manifold. However, it may be a better use of our time to only focus on the smaller notebooks that target our individual modules.
This is just an idea of course so it may be that some of our other maintainers and core contributors feel differently!
@Kautumn06 I'm definitely open to revamping the examples notebook in general. Because we have the binder demo, I think we can easily organize into smaller notebooks. Do you want to take that on in the fall semester so we can get this issue closed?
Hi @bbengfort — Yes, I can work on breaking examples.ipynb
out into smaller notebooks, and then once those are complete, we can remove the examples.ipynb
notebook.
The pull request (#970) I opened today contains some minor updates to the clustering.ipynb
and regression.ipynb
notebooks I created last year. However, I also renamed them to the clustering_visualizers
and regression_visualizers
so that new users who come to our examples
directory will be able to quickly tell what they contain. So with those two completed, the following notebooks that would still need to be created are:
classification_visualizers.ipynb
feature_visualizers.ipynb
model_selection_visualizers.ipynb
target_visualizers.ipynb
text_visualizers.ipynb
Please just let me know if this sounds like a good way forward to closing out this issue.
This sounds like an excellent way to move forward, thank you!
I noticed that the examples.ipynb was out-of-date.
Here are the things that need to be changed/updated.
[ ] Include new visualizers
[x] Remove deprecated Visualizers (e.g ScatterVisualizer)
[ ] When the visualizers are executed some of them raise warnings. For aesthetics of the notebook, we need to turn off all warnings. See example code below
@DistrictDataLabs/team-oz-maintainers