bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
452 stars 114 forks source link

Drugprot dataset misses `test_background` split #927

Closed kai-car closed 1 month ago

kai-car commented 1 month ago

The drugprot dataset loading script drugprot.py currently only includes train and validation split. However, original dataset also provides _testbackground split (see also link below). The _testbackground split consists of 750 abstracts of test set and 10000 abstracts of background set.

https://biocreative.bioinformatics.udel.edu/media/store/files/2021/Track1_pos_1_BC7_overview.pdf

For this reason, I adjusted the data loading script drugprot.py to include the _testbackground split. The related pull request can be found here: https://github.com/bigscience-workshop/biomedical/pull/928