EDP test set is misaligned?

biomedical-translation-corpora / corpora

Parallel corpora for the biomedical domain

48 stars 9 forks source link

EDP test set is misaligned? #3

Open erip opened 3 years ago

erip commented 3 years ago

Hi all, thanks for the resource. I'm interested in using this dataset, but I am finding the EDP test data to be quite misaligned. My script which extracts segments from the bioC files show:

Extracted 818 English and 794 French sents from aleatoire2_EN_FR_bioC.xml
Extracted 851 English and 874 French sents from aleatoire3_EN_FR_bioC.xml

Am I doing something wrong? Do you have a script which can extract the bitext directly?

erip commented 3 years ago

Reading a bit more about the dataset:

Manual evaluation conducted on a sample set suggests that 94% of the sentences are correctly aligned, with about 20% of the sentence pairs exhibiting additional content in one of the languages.

If the data is misaligned, how are we able to use it for evaluation? I'm not sure how to tie a source segment to its reference translation... Perhaps a better question would be if there are any evaluation scripts available. 😄

aneveol commented 3 years ago

Dear erip, thank you for your comments. The distributed version includes the dataset segmented into sentences. It was then aligned with a tool called YASA; the alignment is not always one to one sentence. The evaluation of the alignment is based on the YASA output, which was used in the task.

aneveol commented 3 years ago

Please check the sentence alignment archive at https://cabernet.limsi.fr/EDP_EN.html.

erip commented 3 years ago

Thanks, @aneveol! I somehow completely missed that part of the page -- apologies! So in general it's multi-sentence source, multi-sentence reference. This makes sense!