DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.28k stars 557 forks source link

RadViz fails to plot points if NaNs exist #302

Closed Aylr closed 6 years ago

Aylr commented 6 years ago

Ideas

Steps to Reproduce

  1. Find or create a dataset that has some null/nan values
  2. load a dataframe
  3. create the visualization.
  4. Note that it plot is created with no points. This isn't a great experience for a new user.
from yellowbrick.features import RadViz

classes = list(df['ML_TargetFLG'].unique())
numeric_features = list(df.select_dtypes(include=[float, 'int32']).columns)

x = df[numeric_features].as_matrix()
y = np.where(df['ML_TargetFLG'] == 'Y', 1, np.where(df['ML_TargetFLG'] == 'optin', 2, 0))

vis = RadViz(classes=classes, features=numeric_features)
vis.fit(x, y)
vis.transform(x)
vis.poof()

unknown

bbengfort commented 6 years ago

@Aylr thanks for submitting the issue, it was certainly complete and well thought out!

A quick clarifying question: do all rows in your data set have at least one NaN or None value, or is it just some of the rows in your dataset? I can understand why the computation does not plot rows in the data frame that has a single null value - in fact, I'm surprised that it didn't raise some extremely obtuse exception. However, I would have expected it to plot points that had all their data and simply ignore the rows with missing data (though that's probably even worse in terms of usability).

I've been giving some thought to your suggestions:

  1. Raise an exception - this is probably not the best solution if we can get the visualization to plot some of the points, but I'd like to know what matplotlib and pandas do in this case.
  2. Raise a warning - this is my preferred solution, more on this later.
  3. Automatically impute - there are many strategies for imputation and Scikit-Learn does provide an Imputer transformer, I think it is easier to expect that the user uses an imputer prior to the visualization step in the pipeline (in fact, this visualization is a good signal that they may need an imputer!) If we automatically impute, we could accidentally mislead the user into believing something about the data that isn't true.
  4. Miss something in the docs - I don't think we've encountered this situation yet, so no!

Here is my thought about how to do deal with this:

  1. Find out what Pandas radviz does in this situation, we want to be similar enough so that we don't throw folks off.
  2. Find out what matplotlib does on plt.plot([1,np.nan, 2, np.nan], [1, 2, np.nan, np.nan]) does it plot (1,1) and none of the other points, or does it raise an exception?
  3. Based on 1 and 2 create a strategy that raises a warning to the user when nan values are detected, then either plot a subset of the points or no points at all.
  4. Find out if this issue affects other feature visualizers like parallel coordinates
  5. Update the docs for RadViz and any other feature visualizers affected by the issue to let folks know how to interpret the result or the warning and how to solve, e.g. use an imputer from scikit-learn.

A nice addition to Yellowbrick after this would be to create a MissingValuesVisualizer or something like that, that goes through each column and lists the percentage of np.nan, None, (or supplied missing value). Sort of similar to a boxplot to show the distribution of columns in the data frame, but to highlight underfilled rows. This could also display an additional column - no. of rows missing one or more data elements.

An extra complete solution to this issue would be to publish a blog post, "Finding and Imputing Missing Values with Scikit-Learn and Yellowbrick". It would highlight the issue you discovered here and go through how to use the feature visualizers to interpret the fact that missing values are present, then use various imputer strategies and reflect on how those strategies influence models with visual diagnostics.

Ok, so that's a lot - let me know what you think; we'd love a PR!

Aylr commented 6 years ago

Thanks for the detailed and thoughtful response!

It appears that it is only some of the rows that have null/nan/None values. None of the columns are fully null/nan/None.

I totally agree on your assessment that a warning (number 2 in the first list) is the best UX and most rigorous data-wise.Thanks for the response on each option.

I'll dig into the excellent suggestions 1-3 and 5 to keep the scope contained initially, then 4 can be investigated at another time/issue.

I'm having a hard time imagining the MissingValuesVisualizer and I'd like to see a quick sketch of what you are thinking regarding it. It is a good idea. If you haven't seen or tried the unreal pandas-profiling package, it is pure genius and the value to effort ratio is off the charts. One line generates a highly visual and interactive html report. I think this should be a separate issue for scope containment.

I appreciate your "extra complete" solution challenge, and will take you up on it on my teams blog once a fix is in!

Aylr commented 6 years ago
  1. Using the identical dataframe that causes this issue, pandas radviz plots successfully.
  2. matplotlib plots points that it can and ignores nans.
  3. Since these both appear to plot what they can, I think that this should behave the same. I do think a warning could be useful for new users, though possibly irritating to advanced users. I tend to err on the side of helping out new users.

I'll have an attempt at a PR to plot points that have data (non-null containing rows) and raise a warning.

bbengfort commented 6 years ago

Awesome - thanks! I'll take a look at the PR later this week when I get a chance. I wanted to provide some quick test code in this issue since you're using a data set that I'm unfamiliar with:

Dataset generation:

import numpy as np

from functools import partial
from yellowbrick.features import RadViz 
from sklearn.datasets import make_classification

make_classification = partial(
    make_classification, 
    n_features=8, n_informative=8, n_redundant=0, n_repeated=0, n_classes=3, n_clusters_per_class=1
)

X, y = make_classification()

This gives us a quick, small, and fairly decent classification dataset. We can then add missing values as follows:

def make_missing(X, prob=0.1):
    """
    Fills X with np.nan values according to the specified probability. 
    """
    mask = np.random.rand(*X.shape) < prob 
    data = np.ma.array(X, mask=mask)
    return data.filled(np.nan)

Xm = make_missing(X)

Xm looks something like:

array([[-0.55322532, -0.12594172, -0.43628172,  0.2264543 ,  0.71704831,
         0.1440676 ,  2.43757131,  1.86737237],
       [ 0.33596098,  0.82147654,         nan,         nan,  1.38045263,
         1.0826561 ,  1.80021195, -0.8004254 ],
       [ 2.16903466,  1.04991704, -2.26115809,  0.89759197, -0.74898157,
         2.05476412, -2.10742346,         nan],
       [ 2.40884037, -1.05473497,  1.47406077,         nan,  0.24177879,
         4.08996519,  2.36670599, -0.25488042],
       [        nan, -1.64383729, -1.64765048, -0.91609236, -1.49084724,
        -0.15760646, -1.93852782,         nan],
       [ 2.02547757, -0.51120827,  0.83876765, -1.06591735,  2.07140745,
        -0.05072884,  0.37827661,  0.79013925],
       [ 4.27348958, -0.89094496,         nan, -1.55889069,  1.91152567,
         0.46875422,  0.40472012, -1.1402393 ],
       [-1.84336047, -1.30755628, -0.15463315,         nan, -0.31693837,
         2.74729702,  1.41851952,  2.0069317 ],
       [        nan,  2.7322347 , -2.0142525 , -0.57522097,  3.21226618,
         0.06455021,         nan,  1.51251459]])

I'll try to use this to sketch out the MissingValues plot

bbengfort commented 6 years ago

For missing values I was thinking of something simple like:

barg

But maybe conditioned by class (e.g. different color bar for each classes' percentage of missing values). And maybe add another bar, % rows missing at least one value.

Aylr commented 6 years ago

@bbengfort I stumbled across this rad looking module that you should see. https://github.com/ResidentMario/missingno

I'm not sure about your thoughts on using it here, but it is definitely a cool take on missing data!

ndanielsen commented 6 years ago

@Aylr that is an awesome module, thanks for sharing

bbengfort commented 6 years ago

@Aylr that is a very cool module - perhaps we could contact the author and see if they wanted to include a Yellowbrick interface to their module?

bbengfort commented 6 years ago

Fixed in #304