DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.29k stars 559 forks source link

Call _validate_steps in test_validate_steps #1307

Open danilobellini opened 1 year ago

danilobellini commented 1 year ago

This is a required step as Pipeline._validate_steps method is not called by scikit-learn during construction, but it's expected by tests. I tried running it with scikit-learn v1.2.2 and scikit-learn v1.3.0. I found that in commit 0110921 for v1.1.0 that validation was removed upstream from Pipeline.__init__.

I'm currently the maintainer of the yellowbrick's AUR package and I'm having some trouble to update the package to v1.5.0 because of tests. The mentioned test is one of the failing tests, I brought it here in this pull request because it was simple to fix.

I've called black and pytest for the resulting file.

lwgray commented 1 year ago

@bbengfort Will we need to upgrade to scikit v1.3.0. I worry about calling ._validate_steps()

The _validate_steps method in a scikit-learn Pipeline is a private method used to check whether the steps of the pipeline are defined correctly. In the pipeline, the steps should be structured such that all steps up to the final one should be transformers (i.e., they should have a fit and transform method), and the final step should be an estimator (i.e., it should have a fit method).

Calling _validate_steps() explicitly in your test cases will make sure that this validation is performed at the moment you define the pipeline, rather than later when you try to fit or transform data with the pipeline.

In @danilobellini 's code, adding _validate_steps() after the Pipeline or VisualPipeline instantiation will cause the validation to happen immediately. This means that if there's a problem with the steps (e.g., a non-transformer object in an intermediate step, or a non-estimator object as the final step), a TypeError will be raised immediately, rather than later on when you try to use the pipeline.

This could make the tests clearer and more direct, as he is specifically testing the validation of the pipeline steps, and it's useful to have that validation happen as explicitly and immediately as possible. However, I'm aware that _validate_steps is a private method (indicated by the leading underscore), which means that it's not part of the public API of the Pipeline class and could potentially change in future versions of scikit-learn. Using private methods can sometimes lead to less stable code, as they're not guaranteed to stay the same in the way that public methods are.