automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.63k stars 1.28k forks source link

[Question] How the results of initial_configurations_via_meta_learning are used in Auto-Sklearn? #1553

Closed jmren168 closed 2 years ago

jmren168 commented 2 years ago

Hi,

After reading "initializing Bayesian Hyperparameter Optimization via Meta-Learning", I have a question about how initial_configurations_via_metalearning works in auto-sklearn, and hope someone could give me some hits. Many thanks.

When I enable initial_configurations_via_meta_learning to train a dataset D_n+1, and auto-sklearn found that the most similar dataset was D_J with a specific configuration theta_D_J, how this result is used to drive auto-sklearn to select an initial configuration for dataset D_n+1? case 1. theta_D_J is directly used as an initial configuration for dataset D_n+1 case 2. use dataset D_J to search another configuration (say theta_D_J_new) without limiting models, preprocessors, ..., and then theta_D_J_new is used as initial configuration for dataset D_n+1

If the answer is case 1, but the above paper mentioned only 3 classifiers (a linear SVM, an RBF SVM, and RF) are used in meta-learning, and I found that the type of model for dataset D_n+1 is mlp. This does not make sense. Any comments are highly appreciated.

10 | 48 | 0.00 | mlp | 0.236979 | 7.935055 | 9 | 0.218424 | 0 | 1.659342e+09 | 1.659342e+09 | 0.0 | StatusType.SUCCESS | [] | [feature_agglomeration] | none | Initial design -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 6 | 79 | 0.00 | mlp | 0.244792 | 53.374027 | 5 | 0.175456 | 0 | 1.659342e+09 | 1.659342e+09 | 0.0 | StatusType.SUCCESS | [] | [feature_agglomeration] | weighting | Initial design -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
Louquinze commented 2 years ago

Hi, For beginning, I'm just a student assistant working on this project as well, so don't take this answer as final.

But as i see it. It is like this.

initial_configurations_via_metalearning: int = 25 is an integer value and sets the prior of hyperparameter optimization. Later in the code we than choose n values out off all possible metalearning configurations.

def suggest_via_metalearning(
    meta_base, dataset_name, metric, task, sparse, num_initial_configurations, logger
):

    if task == MULTILABEL_CLASSIFICATION:
        task = MULTICLASS_CLASSIFICATION

    task = TASK_TYPES_TO_STRING[task]

    logger.info(task)

    start = time.time()
    ml = MetaLearningOptimizer(
        dataset_name=dataset_name,
        configuration_space=meta_base.configuration_space,
        meta_base=meta_base,
        distance="l1",
        seed=1,
        logger=logger,
    )
    logger.info("Reading meta-data took %5.2f seconds", time.time() - start)
    runs = ml.metalearning_suggest_all(exclude_double_configurations=True)
    return runs[:num_initial_configurations]

This configurations are used as metalearning_configurations in smac (https://automl.github.io/SMAC3/master/). As i now smac is not limiting the given search space. So i think it is like your case 2.

autosklearn/smbo.py:526

smac_args = {
            "scenario_dict": scenario_dict,
            "seed": seed,
            "ta": ta,
            "ta_kwargs": ta_kwargs,
            "metalearning_configurations": metalearning_configurations,
            "n_jobs": self.n_jobs,
            "dask_client": self.dask_client,
        }
jmren168 commented 2 years ago

Hi @Louquinze ,

Thank you for the comments.

After tracing the code of metalearning_suggest_all , _learn and kBestSuggestions, it looks like meta-learning selects the K nearest datasets similar to the target dataset, and then returns the K best suggestions. If so, meta-learning does not work as the paper I mentioned in my previous post (using two SVMs and RF to select initial configs).

Also, meta-learning returns runs[:num_initial_configurations] and each element in runs is composed of (dataset_name, distance, best_configuration). Here best_configuration is found from runs according to the distance. But

  1. how the best configuration of runs is created? which config space is used?
  2. how the best configuration is used to drive Auto-sklearn to design a configuration for the target dataset? case 1: directly use the best configuration case 2: use the dataset_name to find another best configuration

Any comments are appreciated.

metalearning/kNearestDatasets/kND.py 51-62

        # for each dataset, sort the runs according to their result
        best_configuration_per_dataset = {}
        for dataset_name in runs:
            if not np.isfinite(runs[dataset_name]).any():
                best_configuration_per_dataset[dataset_name] = None
            else:
                configuration_idx = runs[dataset_name].index[
                    np.nanargmin(runs[dataset_name].values)
                ]
                best_configuration_per_dataset[dataset_name] = configuration_idx

        self.best_configuration_per_dataset = best_configuration_per_dataset

metalearn_optimizer/metalearner.py 88

def _learn(self, exclude_double_configurations=True):
        dataset_metafeatures, all_other_metafeatures = self._split_metafeature_array()

        # Remove metafeatures which could not be calculated for the target
        # dataset
        keep = []
        for idx in dataset_metafeatures.index:
            if np.isfinite(dataset_metafeatures.loc[idx]):
                keep.append(idx)

        dataset_metafeatures = dataset_metafeatures.loc[keep]
        all_other_metafeatures = all_other_metafeatures.loc[:, keep]

        # Do mean imputation of all other metafeatures
        all_other_metafeatures = all_other_metafeatures.fillna(
            all_other_metafeatures.mean()
        )

        if self.kND is None:
            # In case that we learn our distance function, get_value the parameters for
            #  the random forest
            if self.distance_kwargs:
                rf_params = ast.literal_eval(self.distance_kwargs)
            else:
                rf_params = None

            # To keep the distance the same in every iteration, we create a new
            # random state
            random_state = sklearn.utils.check_random_state(self.seed)
            kND = KNearestDatasets(
                metric=self.distance,
                random_state=random_state,
                logger=self.logger,
                metric_params=rf_params,
            )

            runs = dict()
            # TODO move this code to the metabase
            for task_id in all_other_metafeatures.index:
                try:
                    runs[task_id] = self.meta_base.get_runs(task_id)
                except KeyError:
                    # TODO should I really except this?
                    self.logger.info("Could not find runs for instance %s" % task_id)
                    runs[task_id] = pd.Series([], name=task_id, dtype=np.float64)

            runs = pd.DataFrame(runs)

            kND.fit(all_other_metafeatures, runs)
            self.kND = kND
        return self.kND.kBestSuggestions(
            dataset_metafeatures,
            k=-1,
            exclude_double_configurations=exclude_double_configurations,
        )

metalearning/kNearestDatasets/kND.py 137

def kBestSuggestions(self, x, k=1, exclude_double_configurations=True):
        assert type(x) == pd.Series
        if k < -1 or k == 0:
            raise ValueError("Number of neighbors k cannot be zero or negative.")
        nearest_datasets, distances = self.kNearestDatasets(x, -1, return_distance=True)

        kbest = []

        added_configurations = set()
        for dataset_name, distance in zip(nearest_datasets, distances):
            best_configuration = self.best_configuration_per_dataset[dataset_name]

            if best_configuration is None:
                self.logger.info(
                    "Found no best configuration for instance %s" % dataset_name
                )
                continue

            if exclude_double_configurations:
                if best_configuration not in added_configurations:
                    added_configurations.add(best_configuration)
                    kbest.append((dataset_name, distance, best_configuration))
            else:
                kbest.append((dataset_name, distance, best_configuration))

            if k != -1 and len(kbest) >= k:
                break

        if k == -1:
            k = len(kbest)
        return kbest[:k]
jmren168 commented 2 years ago

Hi @Louquinze

I think I made a mistake: I should read the original paper of Auto-Sklearn: Efficient and Robust Automated Machine Learning, AAAI, 2015

In this paper, the authors mentioned "We exploit this complimentary by selecting k configurations based on meta-learning and use their result to seed Bayesian optimization...

  1. for each machine learning dataset in a dataset repository (in our case 140 datasets from the OpenML [18] repository), we evaluated a set of meta-features (described below)
  2. used Bayesian optimization to determine and store an instantiation of the given ML framework with strong empirical performance for that dataset.
  3. given a new dataset D, we compute its meta-features, rank all datasets by their L1 distance to D in meta-feature space and select the stored ML framework instantiations for the k = 25 nearest datasets for evaluation before starting Bayesian optimization with their results."

This procedure meets what we found: find the nearest dataset that is most similar to the target dataset D, then directly use the best configuration of the nearest dataset to seed BO.

Any comments are appreciated.

Louquinze commented 2 years ago

Hi @jmren168,

I think that it basically what the metalearning is doing. I will ask someone else to confirm this.

eddiebergman commented 2 years ago

Hi @jmren168,

The meta learningn you originally reffered to is specific for Autosklearn 2 from what I know and you're correct that the original Auto-sklearn paper is the meta-learning that is used in general.

These seeded runs are essentially the first configurations to be tried for a given new dataset so that we start searching from somewhere "reasonable".

Best, Eddie

jmren168 commented 2 years ago

Hi @eddiebergman and @Louquinze ,

Thank you for the reply, and now I have no more questions. So close this question. Thanks again :)