automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.48k stars 1.27k forks source link

Transformer should accept `y` argument in the `transform` method #1494

Open 00sapo opened 2 years ago

00sapo commented 2 years ago

According to sklearn API, a data or feature pre-processor should accept the y argument in transform(). For instance, I'm trying to add balancing algorithms, that do need the y because they add/remove samples and so they have to change the target vector as well.

I think it requires just a simple edit in this line:

https://github.com/automl/auto-sklearn/blob/b2ac331c500ebef7becf372802493a7b235f7cec/autosklearn/pipeline/components/base.py#L253

eddiebergman commented 2 years ago

Yup, that would be a good thing to fix! I apologise that we can fix it quickly but we won't be making a new release until perhaps the end of June due to some other features coming in. You can use the development branch once that fix is in. Otherwise if you need it desperatly at the moment, you can make a fork and use that.

00sapo commented 2 years ago

Well, actually, thinking about it, the main problem is somewhere else as well, because the auto-sklearn doesn't pass the y data to my custom pre-processor. However, today is the first time I looked into the code of auto-sklearn and I can't find that row in the code. Hopefully, I managed to do some hacks by storing the X and the y in the fit method and using transform to change them, ignoring the input during training and returning it during testing.

Are you interested in some state-of-art balancing method? In case, when it's ready, I can do a pull request

eddiebergman commented 2 years ago

I guess the line you're looking for is this one? https://github.com/automl/auto-sklearn/blob/b2ac331c500ebef7becf372802493a7b235f7cec/autosklearn/pipeline/components/feature_preprocessing/__init__.py#L129-L130

At some point, we definitely want to update the whole pipeline to 1) Be more flexible, i.e. you could define your own pipelines 2) Be fully sklearn compliant and 3) Be more accessible from the outside. As you can imagine this is a large task and so smaller baby steps like your issue are a step in the right direction :)

So we do try to be fully in the Sklearn realm, i.e. we don't add XGBoost, however we do have some custom implementation of things so it's not unfeasable as long as it doesn't add requirements. Do you have any literature to link to? Adding choices to the configuration space is a large decision and we would have to benchmark this of course. I imagine this is doing some balancing with respect to distributions of the X data?

@mfeurer when you're back, you might be interested in following up here

Best, Eddie

00sapo commented 2 years ago

Sure, I am trying to use an eversampling and undersampling methods, making BOHB learn which method of the two works better (or both of them).

For oversampling, ProWRAS [1] looks like the state-of-art, with extensive benchmarks against the previous state-of-art. Another method (gamus [2]) is implemented in python, but no fair benchmarks are available (they only tested with old methods, see [3] for extensive benchmarks up to 2019). All the implementations are provided by pyloras, which is consistent with imbalance-learn, which is itself implementing the sklearn api (more or less... they implement a fit_predict method, not fit_transform). ProWRAS has many parameters, but the authors tuned almost all of them, except 2. I'm using those hyper-parameters.

For undersampling, there is a recent paper proposing a boost-like method [4], but since I don't know how to introduce it in the auto-sklearn pipeline and since they also show that classic clustering-based undersampling reaches comparable results [5], I'm using cluster-based method from imbalance-learn, that is almost free ofhyper-parameters.

[1] Bej, Saptarshi, Kristian Schulz, Prashant Srivastava, Markus Wolfien, and Olaf Wolkenhauer. “A Multi-Schematic Classifier-Independent Oversampling Approach for Imbalanced Datasets.” IEEE Access 9 (2021): 123358–74. https://doi.org/10.1109/ACCESS.2021.3108450. [2] Tripathi, Ayush, Rupayan Chakraborty, and Sunil Kumar Kopparapu. “A Novel Adaptive Minority Oversampling Technique for Improved Classification in Data Imbalanced Scenarios.” In 2020 25th International Conference on Pattern Recognition (ICPR), 10650–57, 2021. https://doi.org/10.1109/ICPR48806.2021.9413002. [3] Kovács, György. “An Empirical Comparison and Evaluation of Minority Oversampling Techniques on a Large Number of Imbalanced Datasets.” Applied Soft Computing 83 (October 1, 2019): 105662. https://doi.org/10.1016/j.asoc.2019.105662. [4] Koziarski, Michał. “Radial-Based Undersampling for Imbalanced Data Classification.” Pattern Recognition 102 (June 1, 2020): 107262. https://doi.org/10.1016/j.patcog.2020.107262. [5] “Cluster-Based under-Sampling Approaches for Imbalanced Data Distributions.” Expert Systems with Applications 36, no. 3 (April 1, 2009): 5718–27. https://doi.org/10.1016/j.eswa.2008.06.108.

Implementations:

mfeurer commented 2 years ago

According to sklearn API, a data or feature pre-processor should accept the y argument in transform().

There is indeed an issue here in that this interface is defined wrongly. However, the implementation correctly passes y-values and components such as the select percentile make use of this. TBH I'm somewhat confused about this and why it does not accept a y nor pass it to the underlying algorithm. I'd be very happy about a fix for this issue.

Are you interested in some state-of-art balancing method? In case, when it's ready, I can do a pull request

Yes and no, we have a discussion on why we don't have any balancing methods in issue https://github.com/automl/auto-sklearn/issues/1164

00sapo commented 2 years ago

Actually, the problem is also in the returned value: if we pass X and y, we should return X and y. The existing code, instead, accepts X and y but only returns X. I could go around this line by storing the object that I needed as argument at the fit call, but the problem is that my transformer also changes the y array, but I haven't still found a successful method to change it in place. Anyway, it would be a weird hack and I think that the code of auto-sklearn should accept X and y and return both of them...

eddiebergman commented 2 years ago

Hi @00sapo,

Do you have any test code that illustrates this failing, it would make going through and fixing this a lot easier.

Best, Eddie

mfeurer commented 2 years ago

Hi, unfortunately, we cannot accommodate a transformer changing y because as mentioned above, scikit-learn doesn't define such an interface yet. Because we re-use their pipeline functionality, we are limited to what they provide. Hacking such a feature into auto-sklearn would break latest when one would change the number of data points - there would be no way of determining whether we are in training or test mode, and transformers would change the number of data points for test data (when predicting), too.

00sapo commented 2 years ago

Well, other objects throw an error if their methods are called in an incorrect order. In the code I have in mind, that if one calls transform before calling fit it receives an error and the transformer uses an internal field to check if the fit method was previously called or not. I've seen a similar pattern in other objects in auto-sklearn. In sklearn itself, the fit method is meant to be only called during training, while the transform is meant to be only called during inference.

As example, take any class from imbalance-learn...

mfeurer commented 1 year ago

In sklearn itself, the fit method is meant to be only called during training, while the transform is meant to be only called during inference.

How do you then transform data at training time?

As example, take any class from imbalance-learn

I am not familiar with imba-learn, could you please give some further details?

00sapo commented 1 year ago

Sorry, transform can also be called during training (think about sklearn's PCA object for instance).

In PCA object:

In our balancing case, the training calls fit_transform, while the testing doesn't do anything (unless the balancing algorithms needs to transform the data dimensions, but this usually doesn't happen).

For the example, I will try to put an example here as soon as I have time

BradKML commented 1 year ago

@00sapo any considerations for these?

00sapo commented 1 year ago

I'm not interested anymore in this issue for now, so I won't work on this soon. Btw, as I explained in my previous comment, most of those methods are actually old. See [1] for extensive benchmarking.

[1] Kovács, György. “An Empirical Comparison and Evaluation of Minority Oversampling Techniques on a Large Number of Imbalanced Datasets.” Applied Soft Computing 83 (October 1, 2019): 105662. https://doi.org/10.1016/j.asoc.2019.105662.