Open Innixma opened 3 months ago
Great that you are pushing for this!
A major benefit of having this logic is that we can incorporate any strong and trusted result
True, another big use-case (at least for me) is to be able to quickly see how a method perform on a wide-range of datasets even if the predictions are not included.
Basic mode/Simulator mode
I agree it makes sense to have the option to have only metrics for ease of use. The names may be a bit disconnected with what the modes are, why not just calling the first mode "metric-only" and making clear that ensemble simulations are only supported with model predictions?
Users will need to define their model running code similar to how it is done in AutoMLBenchmark in the exec.py files for frameworks
This could be quite complicated for users. In Tabzilla and in FTTransformer, they provide an example on how to run a simple scikit learn like class, would it be possible to support something like this? I think it would make it much easier for users.
For instance, something like this (just to give the high-level idea):
repo = ...
X_train, y_train, X_test = repo.get_Xy(dataset="Airlines", fold=0)
y_pred = CatBoost().fit(X_train, y_train).predict(X_test)
# output metrics that are comparable with repo.metrics(datasets=["Airlines"], configs=["CatBoost_r22_BAG_L1"], fold=0)
print(repo.evaluate(y_pred))
When we can have simpler API like Autogluon , so as have better understanding about this new library.
@GDGauravDutta We are actively working on this, and a simpler API should be available within the next month.
Related: #55
We should add an interface for users to run a specific model on a specific dataset locally. This will help drive adoption of TabRepo for method papers that are introducing a new model and want to compare against other baselines, similar to how TabZilla is currently being used. The hope is that this feature will do a great deal to resolve the reproducibility / baseline consistency crisis for tabular method papers.
A major benefit of having this logic is that we can incorporate any strong and trusted result of a method into TabRepo's main EvaluationRepository. If someone runs a stronger configuration of a known method, we can either add it alongside the weaker results of a known method or replace the weaker results with the stronger results, depending on what makes more sense. This way we can work to ensure each method in TabRepo is represented by its strongest configuration/search space/preprocessing/etc., greatly reducing the chance methods are misrepresented in terms of their peak capabilities.
Proposal
The fit logic should feature two modes: Basic mode and Simulator mode.
Basic mode doesn't require the user to generate out-of-fold predictions. Therefore the model will not be compatible with TabRepo simulation, but will still be able to be compared to TabRepo results via the test scores. It is important to have a basic mode so that users can avoid doing k-fold bagging if they don't want to. Basic mode should be very similar to what is done in AutoMLBenchmark.
Simulator mode will require the user to additionally produce out-of-fold predictions & probabilities for every row of the training data. We can provide templates to make this easy to do, such as relying on AutoGluon's k-fold bagging implementation or generic sklearn k-fold split. Simulator mode results will be fully compatible with TabRepo, and will allow for simulating ensembles of the user's method with prior TabRepo artifacts.
Requirements:
Model Code
exec.py
files for frameworks. They should ensure that their model is lazy imported to avoid increasing the dependency requirements in TabRepo.sklearn.utils.estimator_checks.check_estimator
. We should also check how TabZilla does this and if we want to re-use any design patterns.pip install TabRepo
followed bypip install MyTabRepoExtension
and use their model extension directly in TabRepo. This will help minimize TabRepo's maintenance burden by avoiding all method contributions being part of TabRepo's source code. We can move proven high performing / important methods into main TabRepo when we deem it worthwhile. The code required for the extension library would be the model source code that would be run on a given task (essentially the AutoMLBenchmark exec.py and setup.sh files)Inputs
Run Artifacts
The resulting artifact should be either an instance of
EvaluationRepository
or very similar toEvaluationRepository
.General
Simulator Mode
Result Aggregation
Parallelization / Distribution (Stretch)
Ensuring reproducibility (Stretch)
pip freeze
output as part of the run artifacts, along with various other information such as num_cpus, num_gpus, OS, date, python version, memory size, disk size, etc. This would help improve reproducibility.Evaluation
EvaluationRepository
object to generate a bunch of tables/plots/statistics on how their method performs vs various baselines/simulated results/etc. For example,repo.compare(my_model_benchmark_results_object)
.Open Questions