DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.3k stars 559 forks source link

Update examples.ipynb to include new Visualizers #470

Open lwgray opened 6 years ago

lwgray commented 6 years ago

I noticed that the examples.ipynb was out-of-date.
Here are the things that need to be changed/updated.

 import warnings
warnings.filterwarnings('ignore')

@DistrictDataLabs/team-oz-maintainers

Kautumn06 commented 6 years ago

If no one else is working on this, I'd be happy to start making some of the updates that Larry suggested.

@DistrictDataLabs/team-oz-maintainers

Kautumn06 commented 6 years ago

I started to update examples.ipynb today and I noticed that it is already quite long, and that does not even include all of the additional visualizers that I need to add. So I think it might be worth considering breaking apart the examples into separate notebooks, such as classification, regression, clustering, feature analysis, model selection, and text modeling.

Once these notebooks are created, we could then always create a "Greatest Hits" type of notebook that contains some of our favorites from each group (and "Greatest Hits, Vol. 2" if we have a hard time selecting our favorites).

Of course, if anyone has any suggestions or different ideas, please just let me know! @DistrictDataLabs/team-oz

bbengfort commented 6 years ago

@Kautumn06 that would be fine with me, just keep in mind that we have Binder set up, and we'd want to configure it to point to all of these notebooks.

lwgray commented 6 years ago

@Kautumn06 Thanks for tackling this 😄

joelvanveluwen-zz commented 6 years ago

@Kautumn06 @lwgray - are you guys still working on this? @Kautumn06 if you are working on sub-notebooks I'm happy to start on a few!

Kautumn06 commented 6 years ago

@joelvanveluwen Thank you for your interest! Yes, I am still working on breaking the original into smaller, more focused notebooks.

I just saw that you offered to work on the Model Selection Tutorial (which is great!), so why don't you check back in with me after that's finished and then, if you're still interested, you can help me with these?

ndanielsen commented 6 years ago

Minor update to examples.ipynb that was raised in this comment: https://github.com/DistrictDataLabs/yellowbrick/pull/479#issuecomment-405290878

Kautumn06 commented 6 years ago

Quick update—the notebook of regression examples is finished and was merged earlier this week. Also, I should have the clustering notebook completed by this weekend.

wagner2010 commented 6 years ago

c1a939d only partially closed 470. reopening.

chalmerlowe commented 5 years ago

NOTE: The suggestion above, to clean up the esthetics for the example notebooks by using this warning suppression code works, but only in some cases.

import warnings
warnings.filterwarnings('ignore')

The matplotlib warning described in this issue: #803 is not suppressed by this snippet. Nor is it suppressed if you use any of these arguments in place of ignore: "error", "ignore", "always", "default", "module", or "once"

chalmerlowe commented 5 years ago

Unless someone else is gonna dive into this, I plan on cleaning up the example notebooks:

chalmerlowe commented 5 years ago

@Kautumn06 I have spent some time trying to pull out the Text visualizer example. There are a number of complexities.

  1. the TSNEVisualizer does not appear to handle labels the way we might desire
  2. the matplotlib color mapping issue is blowing up the example with warnings (see my comment above)

As it stands, the first two TSNE Visualizer examples show the same labels (books, cinema, cooking, gaming, sports). The first example should show all of them. The second example should only show three of them (since it focuses on just three classes (sports, cinema, gaming).

There appears to be some logic in the TSNEVisualizer to create labels, but I think the logic is faulty ... I don't think it correlates the classes to the labels in the right way. trying to decipher it.

corpus = load_corpus('hobbies')
tfidf  = TfidfVectorizer()
docs   = tfidf.fit_transform(corpus.data)
labels = corpus.target 

# Create the visualizer and draw the vectors 
tsne = TSNEVisualizer()
tsne.fit(docs, labels)
tsne.poof()

versus

# Only visualize the sports, cinema, and gaming classes 
tsne = TSNEVisualizer(classes=['sports', 'cinema', 'gaming'])
tsne.fit(docs, labels)
tsne.poof()

Also... the documentation for the TSNEVisualizer doesn't appear to even mention the option of providing classes.

    ...
    decompose_by : int, default: 50
        Specify the number of components for preliminary decomposition, by
        default this is 50; the more components, the slower TSNE will be.

    labels : list of strings
        The names of the classes in the target, used to create a legend.
        Labels must match names of classes in sorted order.

    colors : list or tuple of colors
        Specify the colors for each individual class
    ...

This is getting complicated fast.

Kautumn06 commented 5 years ago

Thank you @chalmerlowe for coming to our sprint and for helping us with this issue! I'll take a look at this tonight since I really want to make sure that our examples notebook are up-to-date and helpful for new Yelllowbrick users. Thanks again for jumping in and helping us make progress on this issue!

chalmerlowe commented 5 years ago

@Kautumn06 I would say that the primary issue is:

Kautumn06 commented 5 years ago

Hi @chalmerlowe, I think you may find it helpful to refer to the development version of our documentation, since we've updated the sample code in our docs since our last release. The TSNEVisualizer example you mentioned had confused others as well, since labels actually stores y (the target), not the labels (which the fit method can derive from y). However, the example was fixed in PR #684 by replacing it with the following:

from sklearn.feature_extraction.text import TfidfVectorizer

from yellowbrick.text import TSNEVisualizer
from yellowbrick.datasets import load_hobbies

# Load the data and create document vectors
corpus = load_hobbies()
tfidf = TfidfVectorizer()

X = tfidf.fit_transform(corpus.data)
y = corpus.target

# Create the visualizer and draw the vectors
tsne = TSNEVisualizer()
tsne.fit(X, y)
tsne.poof()

As for suppressing the warnings, I took a closer look at the earlier suggestion and the reason it didn't work is because the user would need to add that code to their ~/.ipython/profile_default/startup/disable-warnings.py file, rather than run it directly in a Jupyter notebook cell. However, I don't think we would want to ask users to do this, so I will keep looking for a better solution.

nickpowersys commented 5 years ago

The warning from Anscombe's Quartet (anscombe.py, see #803) is connected with the color palette. See the issue for a discussion of the proposed solution, to express the color cycle and/or color palettes in Hex. It would definitely take care of the warning. However, it would need to be approved before it is implemented.

bbengfort commented 5 years ago

@Kautumn06 you did a lot of work to break this into multiple notebooks? Do you have a sense of when we could close this issue?

Kautumn06 commented 5 years ago

Hi @bbengfort — yes, I did break out several of the examples into smaller notebooks but I'd still leave it open for now since I'd like to update the examples with our datasets module. Thank you!

Kautumn06 commented 5 years ago

Hi @bbengfort — I started updating the examples.ipynb notebook this morning with the new datasets module and adding some of our new visualizers and I'm starting to think that it may be worth considering actually just deleting it altogether once the new individual notebooks are complete.

At this point, it has grown so long that I'm not sure how helpful it would be to new users. In addition, if someone simply tries to run the entire notebook they may get frustrated at how long it takes to complete because of a couple of visualizers, such as PCA and Manifold. However, it may be a better use of our time to only focus on the smaller notebooks that target our individual modules.

This is just an idea of course so it may be that some of our other maintainers and core contributors feel differently!

bbengfort commented 5 years ago

@Kautumn06 I'm definitely open to revamping the examples notebook in general. Because we have the binder demo, I think we can easily organize into smaller notebooks. Do you want to take that on in the fall semester so we can get this issue closed?

Kautumn06 commented 5 years ago

Hi @bbengfort — Yes, I can work on breaking examples.ipynb out into smaller notebooks, and then once those are complete, we can remove the examples.ipynb notebook.

The pull request (#970) I opened today contains some minor updates to the clustering.ipynb and regression.ipynb notebooks I created last year. However, I also renamed them to the clustering_visualizers and regression_visualizers so that new users who come to our examples directory will be able to quickly tell what they contain. So with those two completed, the following notebooks that would still need to be created are:

Please just let me know if this sounds like a good way forward to closing out this issue.

bbengfort commented 5 years ago

This sounds like an excellent way to move forward, thank you!