bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
447 stars 114 forks source link

Closes #871 #872

Closed mariosaenger closed 1 year ago

mariosaenger commented 1 year ago

The GNormPlus corpus consists of three sub-parts: BC2GN-Train, BC2GN-Test and NLMIAT, which are all shipped within a zip-archive. However, the last sub-part (NLMIAT) will not be read in the current implementation. This is due to the fact, that the part is also implemented as a distinct data set in Citation GIA Test Collection:

https://github.com/bigscience-workshop/biomedical/tree/main/bigbio/hub/hub_repos/citation_gia_test_collection

However, this might be contrary to the expectations of the users of the dataset and also doesn't not reflect the data discussed / used in the referenced paper. This PR adapts the implementation to use all parts of the downloaded zip-archive. Moreover, incorrect entity offsets are fixed.

sg-wbi commented 1 year ago

Thanks for catching this! This is indeed how the authors intended the corpus to be used (see here )