bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
435 stars 111 forks source link

Create a dataset loader for GAD corpus #608

Closed ruisi-su closed 2 years ago

ruisi-su commented 2 years ago

Adding a Dataset

tmabraham commented 2 years ago

FYI - the issue you linked clearly highlighted this data has questionable labels as a weakly-labeled dataset.

The author of the paper also said:

To conclude, I agree that the GAD and EUADR datasets are weakly supervised (distant supervision) datasets. And since we now have multiple high-quality BioRE datasets, I personally suggest that we need to refrain from using weakly labeled datasets and move to use other datasets such as ChemProt, DrugProt, or other human-labeled datasets for evaluating BioLMs.

ruisi-su commented 2 years ago

Thanks @tmabraham for looking into this. I remember about GAD's labels generating confusions when I was tracking down this dataset. This issue was created to stay consistent with the BLURB dataset. However, I think your point (along with others' concern about this dataset) is very valid. We will discuss and get back to you on this!

jason-fries commented 2 years ago

Actually @ruisi-su @tmabraham , can we keep this as high priority for implementation? The only valid reason to deprioritize a dataset used in a standard benchmark is if that dataset isn’t public. More generally, as a research question, we’re interested in models trained with labels with different provenance (e.g., weakly supervised) to measure performance tradeoffs. From this perspective, datasets like these are quite valuable.

SamuelCahyawijaya commented 2 years ago

self-assign