Use reference dataset on the Hugging Face Hub

GEM-benchmark / GEM-metrics

Automatic metrics for GEM tasks

https://gem-benchmark.com

MIT License

60 stars 20 forks source link

Use reference dataset on the Hugging Face Hub #73

Closed lewtun closed 2 years ago

lewtun commented 2 years ago

This PR updates the URL for the reference datasets to point to the GEM/references repository on the Hugging Face Hub.

As discussed with @sebastianGehrmann, this new URL will act as the central source of truth for the GEM references going forward.

Note that the GEM/references repository is public (i.e. visible by anyone). If we want to make it private we'll need to include an API token in the HTTP call to access each of the datasets.

Before merging we should check:

[x] All datasets have been successfully replicated from GitHub to the Hugging Face Hub.
[x] All Hugging Face Hub datasets have been converted to JSON and validated against the original GitHub files. See here for a script that was used to convert and validate the MLSUM references.

tuetschek commented 2 years ago

Sounds good! Does this mean there will be specific GEM versions of the datasets on the HF hub, or will they be connected to the default versions on there?

lewtun commented 2 years ago

Sounds good! Does this mean there will be specific GEM versions of the datasets on the HF hub, or will they be connected to the default versions on there?

Yes, my understanding is that we'll eventually deprecate all the datasets that are currently hosted as a release on this repository in favour of having everything under the GEM organisation on the Hugging Face Hub.

For example, the MLSUM datasets are now replicated on the GEM/references repository and the plan would be to do the same for the remaining datasets.

By "versions" are you just referring to the dataset contents or do we need to keep track of specific revisions (e.g. a Git commit SHA or tag)? If the latter, one option would be to use the load_dataset() function of the datasets library (docs) which allows on to load a file at a given SHA / tag.

tuetschek commented 2 years ago

Yes, my understanding is that we'll eventually deprecate all the datasets that are currently hosted as a release on this repository in favour of having everything under the GEM organisation on the Hugging Face Hub.

Sounds good to me!

By "versions" are you just referring to the dataset contents or do we need to keep track of specific revisions (e.g. a Git commit SHA or tag)? If the latter, one option would be to use the load_dataset() function of the datasets library (docs) which allows on to load a file at a given SHA / tag.

I meant whether e.g. our version hosted as GEM/references/e2e_nlg_test.json would be somehow connected to the HF version of the same dataset. It now looks like these will stay separate?

lewtun commented 2 years ago

It now looks like these will stay separate?

Yes, that's my understanding. Since most of the reference datasets are public, we can derive files like GEM/references/e2e_nlg_test.json directly from their corresponding dataset GEM/e2e_nlg (this particular dataset still needs to be converted from the old gem dataset).

sebastianGehrmann commented 2 years ago

Hey @lewtun Do we have all the necessary datasests in the hub now so we can merge?

lewtun commented 2 years ago

Hey @lewtun Do we have all the necessary datasests in the hub now so we can merge?

Yep, this should be good to merge!