bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
461 stars 115 forks source link

Proposal to add EMEA Parallel datasets #338

Open FremyCompany opened 2 years ago

FremyCompany commented 2 years ago

I wouldn't mind contributing translation pairs for the EMEA drug notices. They are already available here:

This is a parallel corpus made out of PDF documents from the European Medicines Agency. All files are automatically converted from PDF to plain text using pdftotext.

https://opus.nlpl.eu/EMEA.php

I have also found manually aligned data on the same dataset, which is not very well known, but I could contribute too:

https://link.springer.com/chapter/10.1007/978-3-642-40802-1_32

If that dataset is not already being considered for span tagging, I could work on that too.

What are your thoughts?

hakunanatasha commented 2 years ago

Hi @FremyCompany let us know if you'd still like to work on this - we're happy to confirm it.

qanastek commented 2 years ago

I already have implemented it https://huggingface.co/datasets/qanastek/EMEA-V3

@hakunanatasha Should I convert it to the BigScience project schema too ?

hakunanatasha commented 1 year ago

@qanastek you're welcome to PR this in the bb schema; we're going through a backlog of issues now.