Open erip opened 3 years ago
Reading a bit more about the dataset:
Manual evaluation conducted on a sample set suggests that 94% of the sentences are correctly aligned, with about 20% of the sentence pairs exhibiting additional content in one of the languages.
If the data is misaligned, how are we able to use it for evaluation? I'm not sure how to tie a source segment to its reference translation... Perhaps a better question would be if there are any evaluation scripts available. 😄
Dear erip, thank you for your comments. The distributed version includes the dataset segmented into sentences. It was then aligned with a tool called YASA; the alignment is not always one to one sentence. The evaluation of the alignment is based on the YASA output, which was used in the task.
Please check the sentence alignment archive at https://cabernet.limsi.fr/EDP_EN.html.
Thanks, @aneveol! I somehow completely missed that part of the page -- apologies! So in general it's multi-sentence source, multi-sentence reference. This makes sense!
Hi all, thanks for the resource. I'm interested in using this dataset, but I am finding the EDP test data to be quite misaligned. My script which extracts segments from the bioC files show:
Am I doing something wrong? Do you have a script which can extract the bitext directly?