DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.28k stars 557 forks source link

ParallelCoordinates couldn't handle missing value when using normalization #360

Open Juan0001 opened 6 years ago

Juan0001 commented 6 years ago

While I was trying to use ParallelCoordinates with normalization on a dataset with missing value, I got the following error.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I managed to get around it by normalize my data (by ignoring the missing value) before feed into the visualizer. Hope you can fix it within the visualizer.

Thank you.

bbengfort commented 6 years ago

Missing data is definitely a problem for the visualizers. In general, we expect something that looks like this:

from sklearn.preprocessing import Imputer 
from sklearn.pipeline import Pipeline 
from yellowbrick.features import ParallelCoordinates 

model = Pipeline([
    ('impute', Imputer()), 
    ('viz', ParallelCoordinates()), 
])

model.fit_transfrorm(X, y) 

In the near term, perhaps this will help? In the medium term, @ndanielsen is working on some missing data visualizers (#366) and RadViz has actually been updated to visualize anything that is not a nan (#302) so we could do that with parallel coordinates as well and prevent this problem.

@Juan0001 thanks for posting the issue!

Juan0001 commented 6 years ago

@bbengfort That's exactly what I did for my problem. If the missing values are imputed before feeding into ParallelCoordinates, it will not have any problem. But I think it will be good to have a choice if we want to impute the missing value first before we use ParallelCoordinates. I will take a look at the package see if I could have any improvement on that.

Thank you very much!