Remapping matrices for missing languages covered by WMT16 and MLQE-PE

potamides commented 3 years ago

Hello there, I wanted to evaluate XMoverScore on the WMT16 and MLQE-PE datasets. Since remapping matrices did not exist for all language pairs, I head to compute some myself. With this pull request I want to contribute these remapping matrices to the XMoverScore project. I provide new CLP and UMD projection tensors extracted from both the 8th and 12th mBERT layer for the following language directions:

English-German
Romanian-English
English-Chinese
English-Russian
Nepali-English
Sinhala-English

For English-German and Romanian-English I used the Europarl v7 corpus, for the non-european language pairs English-Chinese and English-Russian I used the UN v1 corpus and for the low-resource language pairs Nepali-English and Sinhala-English I used the flores v1 corpus. This is also reflected in the file names. The following two tables summarize the reached Pearson correlation on both datasets. New language pairs are highlighted:

Pearson (WMT16)	de-en	`en-ru`	ru-en	`ro-en`	cs-en	fi-en	tr-en	Average
XMoverScore	27.23	42.59	50.22	20.96	36.06	6.46	5.68	28.9
XMoverScore (UMD)	28.71	45.74	48.87	23.14	38.63	10.97	9.75	30.85
XMoverScore (CLP)	35.19	49.19	45.09	29.21	45.85	34.13	24.31	36.75

Pearson (MLQE-PE)	`en-de`	`en-zh`	ru-en	`ro-en`	et-en	`ne-en`	`si-en`	Average
XMoverScore	3.01	5.99	39.29	63.76	31.55	49.36	5.48	28.35
XMoverScore (UMD)	4.46	6.99	39.64	65.12	37.21	48.65	12.38	30.64
XMoverScore (CLP)	14.61	10.89	34.08	66.81	51	33.24	10.15	31.54

andyweizhao commented 3 years ago

merged. thanks very much, @potamides ! The results look nice overall.

btw, do you have any idea why the original XMoverScore performs so badly on MLQE-PE (en-de, en-zh and si-en)?

potamides commented 3 years ago

I think the cause for the bad performance on the high-resource language pairs en-de and en-zh is caused by a lack of variability in the assigned scores. This is also already discussed in the MLQE-PE and WMT 2020 Shared Task on Quality Estimation papers. Relevant excerpt:

MT quality for the high-resource language pairs, in particular English-German, was the most challenging to predict. As discussed in Fomicheva et al. (2020a), the MT outputs for this language pair have little variability in terms of perceived MT quality. The vast majority of translations were assigned high scores during DA evaluation, which makes it difficult to capture any meaningful variation between the DA scores.

The bad performance on si-en is less obvious to me however. This could be due to the generally comparatively poor performance of mBERT on low-resource languages.

andyweizhao commented 3 years ago

Thanks! That makes sense to me. The biased datasets (en-de and en-zh) could be too hard for metrics, even for bilingual experts, to properly assign scores.

btw, I found that Sinhala is not covered in mBERT, and that could lead to the bad results on si-en. But it's nice to see that re-mapping techniques help a bit.

potamides commented 3 years ago

btw, I found that Sinhala is not covered in mBERT, and that could lead to the bad results on si-en. But it's nice to see that re-mapping techniques help a bit.

Ah that makes sense. Thanks for looking into that! I wasn't aware of that.

AIPHES / ACL20-Reference-Free-MT-Evaluation

Remapping matrices for missing languages covered by WMT16 and MLQE-PE #2