autogluon / tabrepo

Apache License 2.0
33 stars 8 forks source link

Add TabRepo artifacts to HuggingFace #66

Open Innixma opened 2 months ago

Innixma commented 2 months ago

Add TabRepo artifacts to HuggingFace for faster downloads and improved visibility.

geoalgo commented 1 month ago

I took a look and it seems that it would be best to call directly snapshot_download from HF which download the files in parallel and should be quite efficient.

One thing though is that we have right now the files on s3 that are listed in the context and this would have to be removed. The way I was thinking to have a similar behavior (be able to download only a subset of tasks) is to just whitelist the names of the files, basically calling snapshot_download while setting allow_patterns to the list of datasets that are wanted.

Would this option work for you?

Innixma commented 1 month ago

@geoalgo Sounds reasonable. If you want you can start with a toy subset of the data for proof of concept (such as 3 of the smallest datasets). We can then iterate from there. And probably we can host the full artifact via AutoGluon's huggingface account after we confirm it works on the toy example.

Regarding partial downloads, whitelisting sounds good, but will have to see how it works in practice.

Innixma commented 1 month ago

Based on the wording: "If provided, only files matching at least one pattern are downloaded."

Could we send a list of patterns that are the full file path so it is identical to the current logic?

geoalgo commented 1 month ago

Thanks for your answers, I would be keen to be sure that this solution works for you before its implemented otherwise its wasted effort :-)

@geoalgo Sounds reasonable. If you want you can start with a toy subset of the data for proof of concept (such as 3 of the smallest datasets). We can then iterate from there. And probably we can host the full artifact via AutoGluon's huggingface account after we confirm it works on the toy example.

I do not think we need a proof of concept as the HF hub is quite robust and it is as easy to host 3 datasets or all of them. The main thing I need is for you to add me to AG organization so that I can write files there, alternatively I can create a space just for this dataset.

Based on the wording: "If provided, only files matching at least one pattern are downloaded." Could we send a list of patterns that are the full file path so it is identical to the current logic?

I can give it a try to have exactly the same logic but it seems to me that what we want is to download everything except for the datasets predictions which are heavy are have to be filtered. A download call could look like this for instance:

from huggingface_hub import snapshot_download

def download_datasets(datasets: List[str]):
    allow_patterns = [dataset for dataset in datasets] + ["baselines.parquet", "configs.parquet"]
    snapshot_download(
        repo_id="autogluon/tabrepo",
        repo_type="dataset",
        allow_patterns=allow_patterns,
        local_dir="local_path",
    )

This would download only the predictions whose datasets are in the desired context. As far as I can see, the behavior would be identical to the current one.

Innixma commented 1 month ago

Thanks for the response! All of this looks good.

Do you foresee any downsides with us creating a new space such as tabrepo vs using the autogluon space? Unsure what the limitations are. If you want to go forward with the autogluon space, I can look to grant you write permissions.

geoalgo commented 1 month ago

Using Autogluon would be perhaps cleaner given that the repository is located into AG github space but I do not mind. I would also have to check if I can create a space for tabrepo (I already created one for synetune not sure if I can easily create many, I will try and let you know!