DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.3k stars 559 forks source link

Random input feature dropping curve, model selection visualization [issue #1024] #1206

Closed charlesbmi closed 2 years ago

charlesbmi commented 2 years ago

Summary

This PR addresses https://github.com/DistrictDataLabs/yellowbrick/issues/1204, which requested a feature-dropping-curve (also known as a neuron dropping curve in neural decoding).

I have made the following changes:

  1. Added a module to yellowbrick.model_selection called dropping_curve. It is largely based on yellowbrick.model_selection.validation_curve

Sample Code and Plot

from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

from yellowbrick.datasets import load_game
from yellowbrick.model_selection.dropping_curve import dropping_curve

# Load Connect-4 game data
X, y = load_game()
X_enc = OneHotEncoder().fit_transform(X)
le = LabelEncoder()
y_enc = le.fit_transform(y)

dropping_curve(
    MultinomialNB(),
    X_enc,
    y_enc,
    feature_sizes=np.linspace(0.05, 1, 20),
)

Figure_1

TODOs and questions

Still to do:

Questions for the @DistrictDataLabs/team-oz-maintainers:

CHECKLIST

bbengfort commented 2 years ago

@charlesincharge Happy New Year! And thank you for opening this PR into Yellowbrick! @lwgray @pdamodaran @Kautumn06 would any of you be interested in doing the code review for this?

pdamodaran commented 2 years ago

@bbengfort - apologies for the delay in responding. I can review this PR next weekend.

bbengfort commented 2 years ago

Thanks @pdamodaran!

codecov[bot] commented 2 years ago

Codecov Report

Merging #1206 (67566cc) into develop (6fb2e9b) will increase coverage by 0.02%. The diff coverage is 92.30%.

@@             Coverage Diff             @@
##           develop    #1206      +/-   ##
===========================================
+ Coverage    90.45%   90.48%   +0.02%     
===========================================
  Files           91       92       +1     
  Lines         5135     5200      +65     
===========================================
+ Hits          4645     4705      +60     
- Misses         490      495       +5     
Impacted Files Coverage Δ
yellowbrick/model_selection/dropping_curve.py 92.18% <92.18%> (ø)
yellowbrick/model_selection/__init__.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6fb2e9b...67566cc. Read the comment docs.

lwgray commented 2 years ago

@charlesincharge what version of matplotlib did you use when developing this feature?

charlesbmi commented 2 years ago

@lwgray I used matplotlib 3.5.0

lwgray commented 2 years ago

@charlesincharge Thanks for contributing such a wonderful visualizer. I have spent quite a bit of time working on it because of conflicts with our recent upgrade to Sci-kit 1.0. To help us get this up and merged, would you mind doing the documentation? You can read more about it here https://www.scikit-yb.org/en/latest/contributing/developing_visualizers.html#documentation

Here is a recent PR that has docs added https://github.com/DistrictDataLabs/yellowbrick/pull/1189

charlesbmi commented 2 years ago

Awesome, thanks for all the help! Hope this is useful for everyone!

lwgray commented 2 years ago

@charlesincharge This is a wonderful visualizer. Do you think you can do the docs?

charlesbmi commented 2 years ago

Thanks! Yes, I'll write some docs!

lwgray commented 2 years ago

@charlesincharge Can you reference https://github.com/DistrictDataLabs/yellowbrick/issues/1235 in your new PR?