autogluon / tabrepo

Apache License 2.0
30 stars 8 forks source link

TabRepo 2.0 Feature Tracker #63

Open Innixma opened 1 month ago

Innixma commented 1 month ago

For TabRepo 2.0, several quality of life changes should be made for ease of use. This list will evolve over time.

P0 (Critical)

P1

P2 (Nice-to-have)

P3

geoalgo commented 1 month ago

Another thing that has been my radar for some time is to have tabrepo on huggingface. It will speedup the download time by ~8x (download is very slow from outside) and would make the dataset more visible.

geoalgo commented 1 month ago

Having an example or an API that allows to "join" two repository would be also be quite useful. One could do:


from tabrepo import load_repository
from tabrepo.utils import merge_repositories
repo = load_repository("D244_F3_C1530_30")
repo_with_new_method = load_repository("D244_F3_C1530_30")
# builds a repository from the two, filter models that appear only in all task, underneath, just calls the repo that contains a given model
repo_union = merge_repositories([repo, repo_with_new_method], force_dense=True)
Innixma commented 1 month ago

@geoalgo Yes, repo joining is something I plan to implement. I added a tracking GitHub issue: #65

geoalgo commented 1 month ago

Sorry for the delay again :-) The list you made sounds great!

One thing I want to mention that I think could be quite useful is adding a way to recover original and transformed features from openml.

Something like that:

df, y = repo.openml_dataframe(dataset="airplane", fold=2) # gets the raw columns from the dataset
X, y = repo.openml_transformed_features(dataset="airplane", fold=2)  # gets the features as provided to the model

This would allow to use Tabrepo to train TabPFN models (probably with larger scales that what they currently use). Also it would make it easier to train new models and add them in tabrepo.