Innixma commented 3 months ago

Related: #55

We should add an interface for users to run a specific model on a specific dataset locally. This will help drive adoption of TabRepo for method papers that are introducing a new model and want to compare against other baselines, similar to how TabZilla is currently being used. The hope is that this feature will do a great deal to resolve the reproducibility / baseline consistency crisis for tabular method papers.

A major benefit of having this logic is that we can incorporate any strong and trusted result of a method into TabRepo's main EvaluationRepository. If someone runs a stronger configuration of a known method, we can either add it alongside the weaker results of a known method or replace the weaker results with the stronger results, depending on what makes more sense. This way we can work to ensure each method in TabRepo is represented by its strongest configuration/search space/preprocessing/etc., greatly reducing the chance methods are misrepresented in terms of their peak capabilities.

Proposal

The fit logic should feature two modes: Basic mode and Simulator mode.

Basic mode doesn't require the user to generate out-of-fold predictions. Therefore the model will not be compatible with TabRepo simulation, but will still be able to be compared to TabRepo results via the test scores. It is important to have a basic mode so that users can avoid doing k-fold bagging if they don't want to. Basic mode should be very similar to what is done in AutoMLBenchmark.

Simulator mode will require the user to additionally produce out-of-fold predictions & probabilities for every row of the training data. We can provide templates to make this easy to do, such as relying on AutoGluon's k-fold bagging implementation or generic sklearn k-fold split. Simulator mode results will be fully compatible with TabRepo, and will allow for simulating ensembles of the user's method with prior TabRepo artifacts.

Requirements:

Model Code

[ ] Users will need to define their model running code similar to how it is done in AutoMLBenchmark in the exec.py files for frameworks. They should ensure that their model is lazy imported to avoid increasing the dependency requirements in TabRepo.
[ ] An alternative to supplying their own model code from scratch, they can instead supply an AutoGluon compatible custom model implementation that runs via AutoGluon, similar to how we ran the original TabRepo baseline methods.
[ ] We should ensure onboarding to this logic is as simple as possible, with helpful unit tests to check for compatibility similar to sklearn.utils.estimator_checks.check_estimator. We should also check how TabZilla does this and if we want to re-use any design patterns.
[ ] We should provide a TabRepo extension library template for method contributions so that the user can essentially do pip install TabRepo followed by pip install MyTabRepoExtension and use their model extension directly in TabRepo. This will help minimize TabRepo's maintenance burden by avoiding all method contributions being part of TabRepo's source code. We can move proven high performing / important methods into main TabRepo when we deem it worthwhile. The code required for the extension library would be the model source code that would be run on a given task (essentially the AutoMLBenchmark exec.py and setup.sh files)

Inputs

[x] OpenML task + fold (ex: Airlines fold 2)
[ ] (Stretch) Add support for custom datasets (not OpenML) -> Refer to AutoMLBenchmark implementation
[x] train data
[x] test data (maybe w/o labels?)
[ ] OpenML feature types
[ ] User specified arguments (model hyperparameters, etc., same as AutoMLBenchmark)
[ ] Benchmark specified arguments (constraints such as time limit, infer limit, etc.)
[ ] Positive Class in Binary Classification

Run Artifacts

The resulting artifact should be either an instance of EvaluationRepository or very similar to EvaluationRepository.

General

[x] test eval_metric scores
[x] test predictions & prediction probabilities
[x] test inference time
[ ] test inference time by batch size
[ ] val predictions & prediction probabilities (if val exists)
[x] train time
[ ] artifact size on disk
[ ] log dump of stdout/stderr
[ ] custom artifact support
[ ] Potentially collaborate with OpenML to add upload support for these run artifacts to OpenML.
[ ] Special exception handling artifacts, in case an error occurs, to help with debugging failures.

Simulator Mode

[ ] OOF prediction probabilities & predictions
[ ] OOF eval_metric scores

Result Aggregation

[x] We should add logic to cache the results so that we don't re-run successful jobs
[ ] Logic to automatically retry failed jobs.
[x] We should add logic to aggregate results across tasks / methods.

Parallelization / Distribution (Stretch)

[ ] To simplify running many jobs at once, we can try to leverage Ray for single machine or distributed clusters.
[ ] Alternatively, we could do a similar approach to AutoMLBenchmark AWS mode / docker mode.
[ ] We could potentially add a compatibility layer to AutoMLBenchmark, where we convert out objects into AutoMLBenchmark objects so that the logic runs via AutoMLBenchmark.
[ ] We could potentially leverage AutoGluon-Bench.
[ ] MLFlow?

Ensuring reproducibility (Stretch)

[ ] We could include pip freeze output as part of the run artifacts, along with various other information such as num_cpus, num_gpus, OS, date, python version, memory size, disk size, etc. This would help improve reproducibility.
[ ] We could dockerize the environment similar to what is available in AutoMLBenchmark's docker mode. The downside of this is that it becomes quite complicated, is time consuming, and most users wouldn't know how to do this properly without a lot of engineering effort on our part to make it seamless.

Evaluation

[x] Users should be able to take their output artifact and pass it into a function/method in an EvaluationRepository object to generate a bunch of tables/plots/statistics on how their method performs vs various baselines/simulated results/etc. For example, repo.compare(my_model_benchmark_results_object).
[ ] Alternatively, leverage the EvaluationRepository join logic (#65) to merge the user's results with the target comparison repository to run the evaluations.

Open Questions

[ ] Should this logic live in TabRepo or a new GitHub repository? The answer likely depends on how many dependencies would need to be added to support this and the demand users would have to run it standalone without using the rest of TabRepo's functionality.

geoalgo commented 2 months ago

Great that you are pushing for this!

A major benefit of having this logic is that we can incorporate any strong and trusted result

True, another big use-case (at least for me) is to be able to quickly see how a method perform on a wide-range of datasets even if the predictions are not included.

Basic mode/Simulator mode

I agree it makes sense to have the option to have only metrics for ease of use. The names may be a bit disconnected with what the modes are, why not just calling the first mode "metric-only" and making clear that ensemble simulations are only supported with model predictions?

Users will need to define their model running code similar to how it is done in AutoMLBenchmark in the exec.py files for frameworks

This could be quite complicated for users. In Tabzilla and in FTTransformer, they provide an example on how to run a simple scikit learn like class, would it be possible to support something like this? I think it would make it much easier for users.

For instance, something like this (just to give the high-level idea):

repo = ...
X_train, y_train, X_test = repo.get_Xy(dataset="Airlines", fold=0)
y_pred = CatBoost().fit(X_train, y_train).predict(X_test)
# output metrics that are comparable with repo.metrics(datasets=["Airlines"], configs=["CatBoost_r22_BAG_L1"], fold=0)
print(repo.evaluate(y_pred))

GDGauravDutta commented 2 months ago

When we can have simpler API like Autogluon , so as have better understanding about this new library.

Innixma commented 2 months ago

@GDGauravDutta We are actively working on this, and a simpler API should be available within the next month.

autogluon / tabrepo

Add a simple method to fit models on datasets #69