automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.55k stars 1.28k forks source link

How to get an explicit expression for an ensemble model, such as the coefficients of a regression ensemble model #1544

Closed belzheng closed 2 years ago

belzheng commented 2 years ago

Short Question Description

Hi, I would like to know if I use auto-sklearn for regression, like specifying that the regressor is random forest, I know that I can get the weights and models of the ensemble model by get_models_with_weights(), show_models().My question is whether I can get the regression coefficients of the final ensemble model by calling some APIs, if not, can users write their own code to get the coefficients or display expressions of the ensemble model?

eddiebergman commented 2 years ago

Hi @belzheng,

So the get_models_with_weights() will give you the ensemble weights for each model included in the ensemble. From here, you can query the model's returned from this dictionary to get further information for that specific model. I'm not really sure what you mean by "regression coefficients of the final ensemble model" in this case.

The way the ensemble works for regression is we do a simple weighted sum over the regression outputs of each model to get a final output. In the case of classification, these are done with the probabilities before being turned to classes.

If you have some sample code of what you would like to be able to do, that would help in me understanding the question some more and can help influence further design :) Thanks for reaching out!

Best, Eddie

belzheng commented 2 years ago

Ok, thanks, let me explain the problem in another way, here is a sample code to illustrate the question: 1 2 3

So,my question is how can I get the specific coefficents in the ensemble model?

eddiebergman commented 2 years ago

Hi @belzheng,

Sorry for the lack of response, conferences have been taking up our time :)

So autosklearn doesn't have coefs_. A simple model may have some simple learned parameters which are often exported as coefs_ but not every model will have that, take for example a KNNRegressir.

Autosklearn will train a large number of different scikit learn models and then produce an ensemble of them, as seen in models_with_weights(). Most of these will probably no have any coefs_.

A good way to illustrate the problem is if we're to have a coefs_ attribute like the regressor does, what would we put in there? There's no meaningful answer that we can come up with that makes sense. The closest reasonable answer is the weights of the models in the final produced ensemble but these models will be different from dataset to dataset and so comparing these weights from run to run has no real use.

If you were to share your use case, I could perhaps point you in the right direction but I feel there may be some misunderstanding of what autosklearn does.

Best, Eddie

belzheng commented 2 years ago

Thank you very much!So my new question is, since there are no coefficients here, how do we make predictions for the regression?

eddiebergman commented 2 years ago

You do not need coefficients for the task of regression, which is to estimate continuous numerical values from the data. You could create a model which always returns the mean of the training data and this is a perfectly valid model that has no coef_ but is technically a valid regression model.

The coefficients you mention are part of the LinearRegressor model as well as some others, generally "linear" models. There is a whole variety of other models that doesn't have these coefs_ in sklearn, for example a DecisionTree. It can do regression without coefs_ but it has other parameters like tree_ that are learned from the data.

This question is no longer really concerned with auto-sklearn and so I will close the issue. I advise exploring the scikit learn documentation some more and find some "non-linear" models and investigate how they perform regression.

Best, Eddie

mfeurer commented 2 years ago

Hey, I think there was some talking past each other. Weights in a linear model are sometimes called "coefficients" and scikit-learn appears to have the convention of giving them the name coefs_. EnsembleSelection is a linear model, so one could assume that there's a variable called coefs_. However, we don't follow this convention here, and instead assign the weights to a variable called weights_. I'm wondering whether we should change the variable name or simply add a second variable coefs_ for compatibility?

eddiebergman commented 2 years ago

That's a good point I didn't consider but I think this would then raise quite a few questions as to what coefs_ means from autosklearn. I could see someone finding automl.coefs_ and then wondering what those are and what they mean. This would lead to the term weights_ and coefs_ both floating around.

I had a brief look at all the ensembles in sklearn as well as their general machine learning guide on ensembles and found no reference to coefs_, my guess is that it is reserved for individual models who use the coefficients directly, and not as weights to other models.

mfeurer commented 2 years ago

Good points regarding the ensembles - we're somewhat a voting classifier, and that doesn't have coefs_, too. Therefore, I think we can leave things as they are.

belzheng commented 2 years ago

Thanks for your reply,and I'm Sorry for taking so long to reply. What I want to ask is if the lasso regression in sklearn is added to the regressor of autosklearn, so that the feature preprocessing of autosklearn can be used, but I don't know how to extract the coefficients of this linear model, just like sklearn can display the model coefficients. When I tried to extract the coefficients after feature transformation by myself, I found that the ''coef_'' attribute of lasso disappeared. Below is the sample code for you to understand my problem better. thanks 1 2 3 4 5 6 7 8 !

eddiebergman commented 2 years ago

Hi @belzheng,

Thanks for the more descriptive answer, you should get the underlying estimator you wrap in you LassoRegression class.

stepauto = regpip1.steps
regressor = stepauto[-1][1]
regressor.coef_  # Error
regressor.estimator.coef_  # Should be here.

You don't need to refit this pipeline, you could also just get it directly from show_models(), the main issue was you are trying to access the coef_ and the wrapped estimator and you need to access the underlying estimator to get all the sklearn parameters.

I hope this solves your problem?

Best, Eddie