alteryx / evalml

EvalML is an AutoML library written in python.
https://evalml.alteryx.com
BSD 3-Clause "New" or "Revised" License
755 stars 85 forks source link

AutoMLSearch: build API to access IDs of ensemble pipeline's input pipelines #3008

Closed dsherry closed 2 years ago

dsherry commented 2 years ago

Background Ensemble models compute predictions from a group of input models, then apply a learning algorithm to combine those predictions into a prediction which is more accurate overall.

Each of the stacked ensembler pipelines built by our automl is constructed by grabbing a few pipelines from the automl leaderboard, and then building a graph where each of the input pipelines' predictions are provided as inputs to the stacked ensembler component. The resulting graph contains a full copy of each of the input pipelines.

Proposal Add an API to our automl search which allows users to look up the IDs of the pipelines used by a particular ensembler pipeline.

This would allow users to dig further into the details of each input pipeline if they want to understand the dynamics of their stacked ensembler better.

An initial thought at the API design is below

automl = AutoMLSearch(...)
automl.search(X, y)
# let's say we look at the rankings and see that an ensembler has ID 42
ensembler = automl.get_pipeline(42)
# we can now use this pipeline to compute predictions, scores, stats, model understanding, etc.
ensembler.fit(X, y)
ensembler.predict(X)
...

# and in addition to grabbing the ensembler pipeline in full, we can grab the list of IDs of the input pipelines
input_pipeline_ids = automl.get_ensembler_input_pipelines(42)
for pipeline_id in input_pipeline_ids:
    pipeline = automl.get_pipeline(pipeline_id)
    pipeline.fit(X, y)
    print(pipeline.feature_importance)
    ...

with pytest.raises(Exception):
    automl.get_ensembler_input_pipelines(7) # raise exception if pipeline isn't an ensembler

@christopherbunn FYI

angela97lin commented 2 years ago

Just a heads up that this is already implemented privately / exposed via automl results, but I agree that we should expose this via a cleaner and easier to use API.

In evalml/automl/automl_search.py:

        if pipeline.model_family == ModelFamily.ENSEMBLE:
            input_pipeline_ids = [
                self._automl_algorithm._best_pipeline_info[model_family]["id"]
                for model_family in self._automl_algorithm._best_pipeline_info
            ]
            self._results["pipeline_results"][pipeline_id][
                "input_pipeline_ids"
            ] = input_pipeline_ids