bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
450 stars 114 forks source link

Create a dataset loader for MoNERo #67

Open hakunanatasha opened 2 years ago

hakunanatasha commented 2 years ago

From https://www.racai.ro/en/tools/text/

napsternxg commented 2 years ago

self-assign

hakunanatasha commented 2 years ago

Hi @napsternxg, can you let us know if you are still working on this so we can update our project board? Please just notify us the status by Friday April 8, no worries if you are not finished but intend to work on this. Please either ping me here at @hakunanatasha or ping the discord admins (with @admins)

napsternxg commented 2 years ago

Hi @hakunanatasha yes I plan to work on this over the weekend.

jason-fries commented 2 years ago

Hi @napsternxg Just a ping on the status of this dataset. Please let us know if you are still working on it and when you plan to submit a PR. Thanks!!

napsternxg commented 2 years ago

Hi @jason-fries thanks for the reminder. I have started work on this in my local branch. Will send a PR early next week.

napsternxg commented 2 years ago

Details on the paper:

@inproceedings{mitrofan-etal-2019-monero,
    title = "{M}o{NER}o: a Biomedical Gold Standard Corpus for the {R}omanian Language",
    author = "Mitrofan, Maria  and
      Barbu Mititelu, Verginica  and
      Mitrofan, Grigorina",
    booktitle = "Proceedings of the 18th BioNLP Workshop and Shared Task",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W19-5008",
    doi = "10.18653/v1/W19-5008",
    pages = "71--79",
}

The corpus is licensed under the Creative Commons License Attribution-ShareAlike 4.0 International. Hence, I have downloaded it and uploaded it in tar.gz format here for usage in the data loader.

MoNERo.tar.gz

The dataset doesn't have any offsets information hence I am going to make a text by joining the tokens via space and computing offsets on the resulting dataset.

napsternxg commented 2 years ago

Added PR: #516