bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
447 stars 113 forks source link

Proposal to add the "Snomed Translation Dictionaries" datasets #337

Closed FremyCompany closed 1 year ago

FremyCompany commented 2 years ago

Hi,

I am author of the following translation corpus:

Parallel corpus built with SNOMED CT (by using LaBSE for candidate matching): This repository provides parallel corpora for medical concepts translation between English (en) and the following languages: Spanish (es); French (fr); Dutch (nl); Danish (da); Swedish (sv);

https://github.com/FremyCompany/snomed-translate-dictionaries

I see that you are already working on a translation corpus (ParaMed). Would it be interesting if I contributed these dictionaries?

I hope the paper for this dataset will be published at EAMT this year, but I already talked about this dataset at CLIN last year as well as at the 2nd Dutch meeting on Dutch Clinical NLP but that doesn't do justice to the dataset which is multi-lingual.

Would it be interesting if I contributed these dictionaries?

hakunanatasha commented 2 years ago

Hi @FremyCompany has anyone followed up on this or the other issue/dataset you suggested? I think our recommendation is that you are welcome to implement these @leonweber @jason-fries @galtay @sunnnymskang tagged in case I am mis-speaking.

hakunanatasha commented 2 years ago

@FremyCompany I'm not sure if someone followed up, but feel free to propose a dataloader for this.

hakunanatasha commented 1 year ago

@FremyCompany feel free to propose this dataloader in a new issue and PR; I will close this due for now.