The GNormPlus corpus consists of three sub-parts: BC2GN-Train, BC2GN-Test and NLMIAT, which are all shipped within a zip-archive. However, the last sub-part (NLMIAT) will not be read in the current implementation. This is due to the fact, that the part is also implemented as a distinct data set in Citation GIA Test Collection:
However, this might be contrary to the expectations of the users of the dataset and also doesn't not reflect the data discussed / used in the referenced paper. This PR adapts the implementation to use all parts of the downloaded zip-archive. Moreover, incorrect entity offsets are fixed.
The GNormPlus corpus consists of three sub-parts: BC2GN-Train, BC2GN-Test and NLMIAT, which are all shipped within a zip-archive. However, the last sub-part (NLMIAT) will not be read in the current implementation. This is due to the fact, that the part is also implemented as a distinct data set in Citation GIA Test Collection:
https://github.com/bigscience-workshop/biomedical/tree/main/bigbio/hub/hub_repos/citation_gia_test_collection
However, this might be contrary to the expectations of the users of the dataset and also doesn't not reflect the data discussed / used in the referenced paper. This PR adapts the implementation to use all parts of the downloaded zip-archive. Moreover, incorrect entity offsets are fixed.