DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.3k stars 559 forks source link

PredictionError plot forces min/max limits on the graph that are unhelpful if you have lots of tiny errors #1193

Closed ianozsvald closed 2 years ago

ianozsvald commented 3 years ago

Describe the bug

The default min/max behaviour in the PredictionError class is not helpful if you have many tiny errors - adding a +/- 1 boundary when you have only tiny errors gives you a plot like:

image

If I remove the limits in regressor/prediction_error.py in draw then I get:

image

I'm working on the Kaggle Optiver Volatility competition. My errors are in the range +/0 0.002 to 0.0000002.

The two relevant lines of code are here https://github.com/DistrictDataLabs/yellowbrick/blob/develop/yellowbrick/regressor/prediction_error.py#L207 and the comment in the code notes that the +/- 1 limits aren't perhaps the best choice.

I'm not sure when limits like this would be a good idea, maybe the idea is to move a few large errors off of the boundary of the plot? Maybe a solution includes a test for the expected range of the errors so a smaller proportional modifier might be used for smaller errors?

To Reproduce

None provided but I can make one if useful.

Dataset Did you use a specific dataset to produce the bug? Where can we access it?

Expected behavior

I'd expect to not have the additional +/- 1 min and max boundary limits on the plot.

Desktop (please complete the following information):

rebeccabilbro commented 3 years ago

Hello @ianozsvald and thanks for reaching out! We should all be so lucky to have such small errors in our regression models 😁 😅

I really like the idea of allowing the error density to influence the plot range! Perhaps we could introduce a user param called something like zoom_in, that focuses in on the main density of the error, even if it means ignoring some error points on the outer boundaries. I think the main challenge will be to keep the plot square, since most folks seem to find these plots easier to reason about when the identity and best fit lines are consistently comparable.

This does seem like an excellent feature to add if anyone is looking for a good candidate for a contribution!

pkaf commented 2 years ago

My 2 cents:

a. Should not plot like this have synchronized x and y limits in a 1:1 aspect ratio axes? b. I reckon code will be neat if we give an ability to set limits to users through kwargs, no? Default can still be [-1, 1].

Idea to set automatic _zoomin is brilliant but bit complex. I can suggest a minimal solution that we set xlim/ylim as [-min(min(y), min(y_pred)), max(max(y), max(y_pred))].

I am happy to make these changes if you agree. I am open to your thoughts too.

bbengfort commented 2 years ago

@pkaf Yes plots, like this must have a 1:1 aspect ratio to make the 45-degree line interpretable. If a user supplies a non 1:1 aspect ratio, I'm fine with that. Technically users can already set the limits on the visualizer by accessing the `ax property of the visualizer, e.g. something like:

oz.ax.set_xlim((0, 0.35))
oz.ax.set_ylim((0, 0.35)) 

So perhaps instead of limits in the kwargs, we could add this to the documentation?

In terms of the zoom_in kwarg, I think that your minimal solution is a very good start - it would be an excellent quick PR to open!

pkaf commented 2 years ago

@bbengfort I will PR the minimal solution. Cheers.

pkaf commented 2 years ago

Pull request #1208 with minimal solution is done. Thank you.

bbengfort commented 2 years ago

Closed by #1208