alteryx / evalml

EvalML is an AutoML library written in python.
https://evalml.alteryx.com
BSD 3-Clause "New" or "Revised" License
734 stars 83 forks source link

allowed_model_families doesn't work in AutoMLSearch initilisation #4437

Closed enfeizhan closed 2 weeks ago

enfeizhan commented 1 month ago

This argument is supposed to confine the model family search scope. However, all model families will be searched no matter what input for this argument.

Version: 0.83.0

Code Sample, a copy-pastable example to reproduce your bug.

import evalml
X, y = evalml.demos.load_breast_cancer()
automl_with_ensembling = evalml.AutoMLSearch(X_train=X, y_train=y,
                                      problem_type="binary",
                                      allowed_model_families=['linear_model'],
                                      max_batches=4,
                                      ensembling=True)
print(automl_with_ensembling.allowed_model_families)
automl_with_ensembling.search()

automl_with_ensembling.allowed_model_families returns a blank list instead of the list of linear models. automl_with_ensembling.search() returns 5 models, which is not limited to linear models:

{1: {'Random Forest Classifier w/ Label Encoder + Imputer + RF Classifier Select From Model': 2.085165023803711, 'Total time of batch': 2.204948902130127}, 2: {'Elastic Net Classifier w/ Label Encoder + Imputer + Standard Scaler + Select Columns Transformer': 1.0889050960540771, 'Logistic Regression Classifier w/ Label Encoder + Imputer + Standard Scaler + Select Columns Transformer': 3.366680145263672, 'Total time of batch': 4.698328256607056}, 3: {'Stacked Ensemble Classification Pipeline': 2.254487991333008, 'Total time of batch': 2.36545991897583}, 4: {'Logistic Regression Classifier w/ Label Encoder + Imputer + Standard Scaler + Select Columns Transformer': 1.7583808898925781, 'Random Forest Classifier w/ Label Encoder + Imputer + Select Columns Transformer': 4.039682149887085, 'Total time of batch': 248.39868783950806}, 5: {'Stacked Ensemble Classification Pipeline': 2.4489309787750244, 'Total time of batch': 2.5626749992370605}}

enfeizhan commented 4 weeks ago

Is there anyone still active here?

eccabay commented 2 weeks ago

Hi @enfeizhan, this is the correct behavior. From the documentation:

        allowed_model_families (list(str, ModelFamily)): The model families to search. ... For default algorithm, this only applies to estimators in the non-naive batches.

The example you provided uses the default algorithm, meaning that the naive batch containing the Random Forest Classifier is still run, which is that first batch you have in your output. The second batch is the first non-naive batch run, which does only include the linear model family estimators. You can see that the allowed models are maintained in automl_with_ensembling.automl_algorithm.allowed_model_families.

Note that both Elastic Net and Logistic Regression are linear models.