DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.27k stars 556 forks source link

AlphaSelection not being scored properly #157

Open bbengfort opened 7 years ago

bbengfort commented 7 years ago

The AlphaSelection visualizer, implemented in #103 has a slight bug:

Right now the alphas and errors selection method uses a search to find the right attributes on the model (rather than responding to specific model names). However, some models return different values for the attributes in different scenarios, resulting in an error regarding the mismatch between x and y values during plotting (must have same shape).

For example, RidgeCV cv_values_ can be:

cvvalues : array, shape = [n_samples, n_alphas] or shape = [n_samples, n_targets, n_alphas], optional | Cross-validation values for each alpha (if store_cv_values=True and cv=None). After fit() has been called, this attribute will contain the mean squared errors (by default) or the values of the {loss,score}_func function (if provided in the constructor).

But the current implementation only handles the shape [n_samples, n_alphas].

Additionally ElasticNetCV mse_path_ can be:

msepath : array, shape (n_l1_ratio, n_alpha, n_folds) | Mean square error for the test set on each fold, varying l1_ratio and | alpha.

Which means we're probably not doing the right average on this array.

Basically, we need to do a better job of figuring out what the alphas and mse error properties are, computing the scores for visualization; I think right now the plots might just be wrong.

bbengfort commented 7 years ago

@NealHumphrey, @balavenkatesan, @ndanielsen -- I could use a second pair of eyes on this if any of you guys had some time to take a look. Checkout my alphas notebook in examples, and the yellowbrick/regressor/alphas.py package.