capitalone / rubicon-ml

Capture all information throughout your model's development in a reproducible way and tie results directly to the model code!
https://capitalone.github.io/rubicon-ml/
Apache License 2.0
129 stars 34 forks source link

[spike] Investigate Scikit-Learn's decision_function and score_samples #179

Closed shania-m closed 2 years ago

shania-m commented 2 years ago

Is your enhancement request related to a problem? Please describe

Scikit-Learn's pipeline api provides two additional methods that are not covered by Rubicon's Scikit-Learn integration: decision_function and score_samples.

Further investigate decision_function and score_samples to determine if these should be integrated in to Rubicon:

Additional Sources decision_function in Scikit-learn examples:

score_samples in Scikit-learn examples:

shania-m commented 2 years ago

Scikit-learn defines decision_function as:

In a fitted classifier or outlier detector, predicts a “soft” score for each sample in relation to each class, rather than the “hard” categorical prediction produced by predict. Its input is usually only some observed data, X.

If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.

Scikit-learn's decision_function is used to predict soft scores for samples. Predictions are out of scope for Rubicon, there is no need to implement logging for decision_function. Note, in cases like the EllipticEnvelope estimator, decision_function is called by predict, which is used in score. Additionally , decision_function utilizes score_samples.

Score_samples on the other hand is used to score individual scores across samples. This could be sum or the mean of all these scores are used to calculate score() for many estimators; such as the FactorAnalysis Estimator, PCA estimator, BayesianGaussianMixture estimator, and HalvingGridSearchCV estimator. Score_samples can also be used for density estimation. Scikit-learn examples show score_samples being used in Density Estimation, Density Estimation for a Gaussian Mixture, Kernel Density Estimation for Species Distributions , and Simple 1D Kernel Density Estimation.

Since score_samples() can be explicitly used to density estimation and is used to calculate scores(), Rubicon should support score_samples() and logging. Similar to the solution proposed in #176, when a user calls score_samples(), a new experiment should be opened unless a user specifies which experiment to log to.