If Metric A, like WMD, WMDo metric, is negative correlated to our target, we give it exp(1-A), which makes sure that 1-A is still normalized in the range 0 to 1. Otherwise, using exp(A) like other positively correlated metric.
Most popular Metrics for evaluation Machine translation:
WMD
Semantic Sentence Cosine Similarity
WMD: XLM-Roberta-base
Cosine Similarity: XLM-Roberta-base-nli-stsb-mean-tokens
BERTScore: XLM-Roberta-base
Metric | ende (pearson) | enfr (pearson) |
---|---|---|
wmd | 36.046 | 31.851 |
Similarity | 48.304 | 44.708 |
Bert score | 33.491 | 29.024 |
Similarity + wmd | 49.870 | 44.981 |
Similarity + Bert score | 48.562 | 43.325 |
Bert score + wmd | 36.668 | 32.020 |
Metric | deen | zhen | fien | lven | ruen | csen | enru | enzh | tren | Avg |
---|---|---|---|---|---|---|---|---|---|---|
wmd | 36.616 | 50.062 | 37.291 | 37.292 | 30.779 | 26.677 | 40.449 | 40.786 | 35.042 | 37.222 |
Similarity | 45.648 | 51.398 | 54.030 | 55.511 | 54.121 | 46.416 | 50.534 | 45.779 | 54.044 | 50.831 |
Bert score | 40.903 | 50.991 | 41.351 | 40.159 | 33.656 | 31.914 | 43.390 | 44.568 | 38.155 | 40.565 |
Similarity + wmd | 50.422 | 59.424 | 56.590 | 56.910 | 53.420 | 47.619 | 53.759 | 51.297 | 56.237 | 53.964 |
Similarity + Bert score | 52.267 | 60.030 | 57.847 | 57.430 | 55.101 | 49.909 | 55.272 | 53.140 | 56.841 | 55.315 |
Bert score + wmd | 38.451 | 49.292 | 37.584 | 36.711 | 32.158 | 29.124 | 41.028 | 42.155 | 33.647 | 37.794 |
WMT-20 quality estimation task
Metric | neen | ende | eten | enzh | roen | sien | ruen | Avg |
---|---|---|---|---|---|---|---|---|
wmd | 36.107 | 45.643 | 46.322 | 25.103 | 64.656 | 30.831 | 31.538 | 40.029 |
Similarity | 31.294 | 33.042 | 48.099 | 40.063 | 69.402 | 40.417 | 44.134 | 43.779 |
Bert score | 35.700 | 45.928 | 45.950 | 25.980 | 67.289 | 30.906 | 31.965 | 40.531 |
Similarity + wmd | 38.967 | 47.229 | 55.256 | 42.683 | 72.431 | 42.588 | 47.582 | 49.534 |
Similarity + Bert score | 39.237 | 48.364 | 55.319 | 42.664 | 72.664 | 42.604 | 47.508 | 49.766 |
Bert score + wmd | 36.431 | 45.568 | 45.831 | 25.419 | 64.543 | 33.126 | 32.032 | 40.421 |
Multi30K | Metric | dede | frfr | Avg |
---|---|---|---|---|
wmd | 49.240 | 42.491 | 45.866 | |
Cos Similarity | 48.672 | 44.636 | 46.654 | |
Bert_Score | 43.389 | 35.205 | 39.297 | |
Cos Similarity + wmd | 54.584 | 50.132 | 52.358 | |
CosSimilarity + Bert_score | 52.653 | 46.153 | 49.403 | |
Bert_Score + wmd | 49.379 | 42.433 | 45.906 |
with XLM-Roberta-Base embedding
WMT-17
Metric | deen | zhen | fien | lven | ruen | csen | tren | Avg |
---|---|---|---|---|---|---|---|---|
wmd | 73.005 | 76.905 | 82.741 | 73.610 | 73.259 | 69.845 | 76.974 | 75.191 |
Cos Similarity | 61.229 | 65.334 | 72.984 | 70.286 | 69.987 | 62.232 | 65.355 | 66.772 |
Bert_Score | 74.479 | 77.477 | 83.324 | 75.636 | 74.555 | 70.971 | 75.083 | 75.932 |
Cos Similarity + wmd | 75.526 | 77.924 | 84.688 | 78.068 | 78.645 | 73.144 | 78.093 | 78.013 |
Cos Similarity + Bert_score | 76.988 | 78.503 | 86.010 | 79.210 | 79.638 | 74.616 | 78.216 | 79.026 |
Bert_Score + wmd | 73.889 | 77.055 | 84.113 | 74.975 | 74.189 | 70.599 | 77.728 | 76.078 |
WMT17 with Roberta-Base
Metric | deen | zhen | fien | lven | ruen | csen | tren | Avg |
---|---|---|---|---|---|---|---|---|
wmd | 0.667 | 0.743 | 0.818 | 0.693 | 0.705 | 0.663 | 0.744 | 0.719 |
Cos Similarity | 0.612 | 0.655 | 0.705 | 0.680 | 0.642 | 0.599 | 0.644 | 0.648 |
Bert_Score | 0.683 | 0.740 | 0.818 | 0.693 | 0.707 | 0.675 | 0.718 | 0.719 |
Cos Similarity + wmd | 0.718 | 0.767 | 0.832 | 0.755 | 0.736 | 0.703 | 0.764 | 0.754 |
Cos Similarity + Bert_score | 0.728 | 0.767 | 0.843 | 0.755 | 0.744 | 0.717 | 0.758 | 0.759 |
Bert_Score + wmd | 0.678 | 0.740 | 0.824 | 0.693 | 0.703 | 0.670 | 0.745 | 0.722 |