Mismatch results between en-de train and develop set

facebookresearch / mlqe

We release a dataset based on Wikipedia sentences and the corresponding translations in 6 different languages along with the scores (scale 1 to 100) generated though human evaluations that represent the quality of the translations.Paper Title Unsupervised Quality Estimation for Neural Machine Translation

Creative Commons Attribution Share Alike 4.0 International

80 stars 14 forks source link

Hello,

Thanks for the interest. We will release the test set labels after the deadline for system submission. We will look into the discrepancies.

Best, Lucia

On Sat, 11 Jul 2020 at 03:58, HuihuiChyan notifications@github.com wrote:

I finetuned Bert on en-de data, and achieve pearson coefficient of 0.31 on dev set. Since the golden label is not available for test set, so I tested my model on en-de training set, totally 7000 sentence pairs (I did not use this training set so it is Okay to use it to do testing), but I got only 0.12 pearson coefficient.

So for en-de, it is 0.31 on dev set, and 0.12 on training set. Why the gap is so big?

I tried the same procedure on en-zh data, and it was 0.41 on dev set and 0.38 on training set. It seems for en-zh data there is no problem.

Besides, when will the test label be available? I really want to make comparison with the results in your paper.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/mlqe/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKFUBEXLZ5RUO2PWVRLHU3R27IPBANCNFSM4OXC5C6A .

-- Lucia www.dcs.shef.ac.uk/~lucia/

facebookresearch / mlqe

Mismatch results between en-de train and develop set #1