bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
439 stars 111 forks source link

Mantra GSC new location (closes #891) #916

Closed phlobo closed 1 month ago

phlobo commented 3 months ago

Closes #891

Mantra GSC was moved from the original website to GitHub: https://github.com/mi-erasmusmc/Mantra-Gold-Standard-Corpus/tree/main

This PR makes the loader point to the new URL and creates a HF Hub version of the existing loader script for mantra_gsc.

If the following information is NOT present in the issue, please populate:

Checkbox

leonweber commented 1 month ago

@phlobo I cannot push mantra_gsc to the hub because of the following error:

- "language[0]" with value "en, fr, de, nl, es" is not valid. It must be an ISO 639-1, 639-2 or 639-3 code (two/three letters), or a special value like "code", "multilingual". If you want to use BCP-47 identifiers, you can specify them in language_bcp47.

Could you please open a PR to fix this?

phlobo commented 1 month ago

@leonweber thank you for taking care of this, please see #923