bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
447 stars 114 forks source link

GNormPlus: Add NLMIAT sub-part to the data set #871

Closed mariosaenger closed 1 year ago

mariosaenger commented 1 year ago

The GNormPlus corpus consists of three sub-parts: BC2GN-Train, BC2GN-Test and NLMIAT, which are all shipped within a zip-archive. However, the last sub-part (NLMIAT) will not be read in the current implementation. This is due to the fact, that the part is also implemented as a distinct data set in Citation GIA Test Collection:

https://github.com/bigscience-workshop/biomedical/tree/main/bigbio/hub/hub_repos/citation_gia_test_collection

However, this might be contrary to the expectations of the users of the dataset and also doesn't not reflect the data discussed / used in the referenced paper. The implementation should be adapted to include all parts of the downloaded zip-archive.