autogluon / tabrepo

Apache License 2.0
27 stars 7 forks source link

Refactor ZeroshotSimulatorContext #6

Open Innixma opened 1 year ago

Innixma commented 1 year ago

ZeroshotSimulatorContext was written early on in the codebase and is hacky. It would be good to revisit and clean up the class, especially in terms of variable naming consistency with the rest of the package and more elegantly incorporating zpp/gt objects.

Most critically, the following code in init is very confusing, and we should make the most of property functions to minimize confusion and code complexity:

        self.df_results_by_dataset_vs_automl, \
        self.df_raw, \
        self.dataset_name_to_tid_dict, \
        self.dataset_to_tid_dict, \
        self.dataset_name_to_fold_dict, \
        self.unique_dataset_folds, \
        self.unique_datasets, \
        self.rank_scorer_vs_automl = self.align_valid_folds(
            df_results_by_dataset=df_results_by_dataset,
            df_results_by_dataset_automl=df_results_by_dataset_automl,
            df_raw=df_raw,
            folds=folds,
            score_against_only_automl=self.score_against_only_automl,
            pct=self.pct,
        )
        self.dataset_parent_to_fold_map = self._compute_dataset_parent_to_fold_map()

        tmp = self.df_results_by_dataset_vs_automl[['dataset', 'tid', 'problem_type']]
        self.dataset_to_problem_type_dict = tmp[['dataset', 'problem_type']].drop_duplicates().set_index(
            'dataset').squeeze().to_dict()
        self.tid_to_problem_type_dict = tmp[['tid', 'problem_type']].drop_duplicates().set_index(
            'tid').squeeze().to_dict()

For example:

  1. self.unique_dataset_folds can simply be self.tasks as a property
  2. self.unique_datasets can simply be self.datasets as a property
  3. self.dataset_to_tid_dict can be simplified to a method call
  4. ditto for self.dataset_name_to_fold_dict, which should be named self.task_to_fold_dict