automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.54k stars 1.27k forks source link

`predict_proba` in classifier estimators is doing needless assertions #1529

Open eddiebergman opened 2 years ago

eddiebergman commented 2 years ago

https://github.com/automl/auto-sklearn/blob/5e21e9cbd405eaef47b5e5d68cf092254ccffb51/autosklearn/estimators.py#L1453-L1465

There's a lot of assertion checking to do here which can really eat into inference time. While the checks are helpful, they seem like they should really be enforced in testing.

Mathanraj-Sharma commented 2 years ago

@eddiebergman I recently started using auto-sklearn and would like to contribute to it.

Could you please assign this issue to me and guide me on it?

eddiebergman commented 2 years ago

Hi @Mathanraj-Sharma,

For sure, happy to have a contribution! Essentially while doing some refactoring I noticed these assert statements and I found them very "test-like", i.e. this should really be a test and not done everytime some calls predict_proba. Normally assertions are not to costly but seeing as this does two checks over the entire probability array, I imagine it's not a minor

It is good however to have these checks but I can imagine a user getting these messages and then not knowing what to do with them. To be honest, if I got this message, I wouldn't even know what to do with it. Seeing as it's never been raised as an issue, I assume we are safe and this isn't an issue.

So, seeing as this has never been raised and the recovery mechanism isn't clear, I would want to essentially drop these statements if they take a meaningful amount of time. This means we would need to do a quick benchmark of just how long they actually take with various sizes of probabilities to confirm it takes a while, and if so, drop them. Secondly, we also need to make sure we have some test somewhere that captures this assertion. On individual components themselves and as a whole. I did a quick check and we actually don't do any such check on classifier components and we actually have no such check for automl instances either.

Testing the second part is not the most straightforward. The first thing to not is that we have Estimators which are supposed to be an sklearn compliant wrapper of AutoML which is doing the grunt work of everything. An estimator is mostly just calling it's underlying automl methods. Hence, we can test the probabilities on an AutoML.

We use cases which are trained AutoML instances that are cached and have their trained state copied so we can do different tests without retraining each time. These cases are passed as a parameter to the tests but extracting information about how these cases were trained is a bit more difficult. For testing the multi-label part, it would require something like

from autosklearn.automl import AutoMLClassifier
from autosklearn.constants import MULTILABEL_CLASSIFICATION

from pytest_cases import parametrize_with_cases

import test.test_automl.cases as cases

@parametrize_with_cases("automl", cases=cases, has_tag=["classifier", "fitted"]))
def test_predict_proba_consistent_output(automl: AutoMLClassifier) -> None:
    """
    Expects
    -------
    * Whatever we expect of the output
    """
    task_type = automl._task
    if task_type == MULTILABEL_CLASSIFICATION:
       assert ...       

    assert ...

Recap:

  1. Benchmark to get an idea of how expensive these checks actually are.
  2. Add a test to individual classifier components.
  3. Add a test to the trained and cached AutoML instances
eddiebergman commented 2 years ago

Please let me know if there's anymore info or help you need!

Mathanraj-Sharma commented 2 years ago

@eddiebergman thanks for assigning it to me and for the explanation, let me go through it and get back to you.

Mathanraj-Sharma commented 2 years ago

@eddiebergman

By benchmarking, do you want me to compare the run time of the predict_proba method, with and without the assert statements for a set of datasets? In that case, do you have any particular datasets in mind?

eddiebergman commented 2 years ago

Hi @Mathanraj-Sharma,

Datasets are too much, this could just be with toy data thats of comparable size to perhaps a "small" dataset and a "large" one. The one that would take the longest time is probably valid Multi-Label based probabilities but it would be good to see for the other two, multiclass and binary.

Edit: I guess we do need some real example to have an idea of the actual time required to generate the probabilities from the models so we can compare how much the assertions take, relative to the actual predictions.

The reason we need the benchmark is we have disagreement with respect to how much it might impact an enduser.

Best, Eddie

Jingyu112 commented 10 months ago

Hello @eddiebergman,

I was wondering if this request is still open ? or it has been solved already?

Thanks in advance