automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.48k stars 1.27k forks source link

[Question] How to get values of categorical variables from a fit model? #1634

Open eliwoods opened 1 year ago

eliwoods commented 1 year ago

How can one get back the values of categorical variables from a fitted model? Let's say that I have a model where one of the features is a categorical variable, is there a way to get back the values of that variable that were observed during training?

My use case is I have a multi-model system that I am building (multiple auto-sklearn models, not ensemble) and I want to implement some logic for deciding which model to use depending on if a certain categorical value was observed during training. In scikit-learn this could be easily accessed from the categories_ attribute of a OneHotEncoder, but given the complex nature of auto-sklearn classes and use of ensembles I'm not sure where to begin looking.

Alternatively, one could set an encoder to error on unknown values, and build logic around catching these errors. This doesn't work for auto-sklearn either because the "missing" category is always created, so models will always successfully predict on missing values without any sign that it was on an unknown categorical value.

Any help here would be appreciated.

Running version 0.14.7

eddiebergman commented 1 year ago

Hi @eliwoods,

Sorry for the long delay. Have you tried automl.show_models(), each of the components is there and you can introspect into them accordingly. You can find the components here.

However I'm fairly certain this will only show you models that are preserved on disk. In scenarios where you are running for a long time. There will be a lot of models trained so we delete them from disk according to performance and memory cost. In this case, knowledge of was it "observed during training". You can use configs = auto_estimator.automl_.runhistory_.get_all_configs() where auto_estimator is your AutoSklearnClassifier or AutoSklearnRegressor. These configs will then have all the evaluated configurations which you can then look up :)