Closed Aylr closed 6 years ago
@Aylr thanks for submitting the issue, it was certainly complete and well thought out!
A quick clarifying question: do all rows in your data set have at least one NaN or None value, or is it just some of the rows in your dataset? I can understand why the computation does not plot rows in the data frame that has a single null value - in fact, I'm surprised that it didn't raise some extremely obtuse exception. However, I would have expected it to plot points that had all their data and simply ignore the rows with missing data (though that's probably even worse in terms of usability).
I've been giving some thought to your suggestions:
Here is my thought about how to do deal with this:
plt.plot([1,np.nan, 2, np.nan], [1, 2, np.nan, np.nan])
does it plot (1,1) and none of the other points, or does it raise an exception?A nice addition to Yellowbrick after this would be to create a MissingValuesVisualizer
or something like that, that goes through each column and lists the percentage of np.nan, None, (or supplied missing value). Sort of similar to a boxplot to show the distribution of columns in the data frame, but to highlight underfilled rows. This could also display an additional column - no. of rows missing one or more data elements.
An extra complete solution to this issue would be to publish a blog post, "Finding and Imputing Missing Values with Scikit-Learn and Yellowbrick". It would highlight the issue you discovered here and go through how to use the feature visualizers to interpret the fact that missing values are present, then use various imputer strategies and reflect on how those strategies influence models with visual diagnostics.
Ok, so that's a lot - let me know what you think; we'd love a PR!
Thanks for the detailed and thoughtful response!
It appears that it is only some of the rows that have null/nan/None values. None of the columns are fully null/nan/None.
I totally agree on your assessment that a warning (number 2 in the first list) is the best UX and most rigorous data-wise.Thanks for the response on each option.
I'll dig into the excellent suggestions 1-3 and 5 to keep the scope contained initially, then 4 can be investigated at another time/issue.
I'm having a hard time imagining the MissingValuesVisualizer
and I'd like to see a quick sketch of what you are thinking regarding it. It is a good idea. If you haven't seen or tried the unreal pandas-profiling package, it is pure genius and the value to effort ratio is off the charts. One line generates a highly visual and interactive html report. I think this should be a separate issue for scope containment.
I appreciate your "extra complete" solution challenge, and will take you up on it on my teams blog once a fix is in!
I'll have an attempt at a PR to plot points that have data (non-null containing rows) and raise a warning.
Awesome - thanks! I'll take a look at the PR later this week when I get a chance. I wanted to provide some quick test code in this issue since you're using a data set that I'm unfamiliar with:
Dataset generation:
import numpy as np
from functools import partial
from yellowbrick.features import RadViz
from sklearn.datasets import make_classification
make_classification = partial(
make_classification,
n_features=8, n_informative=8, n_redundant=0, n_repeated=0, n_classes=3, n_clusters_per_class=1
)
X, y = make_classification()
This gives us a quick, small, and fairly decent classification dataset. We can then add missing values as follows:
def make_missing(X, prob=0.1):
"""
Fills X with np.nan values according to the specified probability.
"""
mask = np.random.rand(*X.shape) < prob
data = np.ma.array(X, mask=mask)
return data.filled(np.nan)
Xm = make_missing(X)
Xm
looks something like:
array([[-0.55322532, -0.12594172, -0.43628172, 0.2264543 , 0.71704831,
0.1440676 , 2.43757131, 1.86737237],
[ 0.33596098, 0.82147654, nan, nan, 1.38045263,
1.0826561 , 1.80021195, -0.8004254 ],
[ 2.16903466, 1.04991704, -2.26115809, 0.89759197, -0.74898157,
2.05476412, -2.10742346, nan],
[ 2.40884037, -1.05473497, 1.47406077, nan, 0.24177879,
4.08996519, 2.36670599, -0.25488042],
[ nan, -1.64383729, -1.64765048, -0.91609236, -1.49084724,
-0.15760646, -1.93852782, nan],
[ 2.02547757, -0.51120827, 0.83876765, -1.06591735, 2.07140745,
-0.05072884, 0.37827661, 0.79013925],
[ 4.27348958, -0.89094496, nan, -1.55889069, 1.91152567,
0.46875422, 0.40472012, -1.1402393 ],
[-1.84336047, -1.30755628, -0.15463315, nan, -0.31693837,
2.74729702, 1.41851952, 2.0069317 ],
[ nan, 2.7322347 , -2.0142525 , -0.57522097, 3.21226618,
0.06455021, nan, 1.51251459]])
I'll try to use this to sketch out the MissingValues
plot
For missing values I was thinking of something simple like:
But maybe conditioned by class (e.g. different color bar for each classes' percentage of missing values). And maybe add another bar, % rows missing at least one value.
@bbengfort I stumbled across this rad looking module that you should see. https://github.com/ResidentMario/missingno
I'm not sure about your thoughts on using it here, but it is definitely a cool take on missing data!
@Aylr that is an awesome module, thanks for sharing
@Aylr that is a very cool module - perhaps we could contact the author and see if they wanted to include a Yellowbrick interface to their module?
Fixed in #304
Ideas
Steps to Reproduce