get_wmt18_seg_result question

Hou-jing commented 1 year ago

您好，我在运行get_wmt18_seg_result.py文件时，得到的如下结果，这和您在论文中报告的absolute pearson correlation 并不相同。您可以解释下get_wmt18_seg_result.py文件得到的结果吗？我应该怎么样得到论文中报告的absolute pearson correlation结果呢？

model_type cs-en de-en et-en fi-en ru-en tr-en zh-en avg roberta-large P 0.3831702544031311 0.54582257007364 0.39514465541862803 0.2939672801635992 0.35140330642060746 0.29337243401759533 0.2487933567167311 0.35881055103056175 roberta-large R 0.40626223091976515 0.5499350991505058 0.39761287706493187 0.3182515337423313 0.35851595540176856 0.29665689149560115 0.25802680097131037 0.3693230555351735 roberta-large F 0.4140900195694716 0.5548187274292837 0.4033603074698965 0.30610940695296524 0.354479046520569 0.3020527859237537 0.26480199058668347 0.3713874692075177 allenai/scibert_scivocab_uncased P 0.3232876712328767 0.506162367788616 0.3475784982634298 0.21945296523517382 0.3081507112648981 0.22205278592375366 0.2200737476391762 0.3066798210497034 allenai/scibert_scivocab_uncased R 0.3557729941291585 0.5114572489750806 0.35551206784083494 0.25140593047034765 0.31103421760861205 0.23096774193548386 0.23560272206733218 0.32167898900383574 allenai/scibert_scivocab_uncased F 0.34794520547945207 0.5180887021115267 0.3570282611378502 0.24821063394683027 0.31391772395232603 0.23026392961876832 0.23302455256767696 0.3212112869734901 bert-base-chinese P 0.225440313111546 0.4306460526146689 0.30604185398705946 0.17356850715746422 0.23317954632833526 0.19390029325513197 0.18409928950445184 0.24955369370837963 bert-base-chinese R 0.2802348336594912 0.4495379830615209 0.32434195447894076 0.21434049079754602 0.2808535178777393 0.20351906158357772 0.1951314566657673 0.27827989973208334 bert-base-chinese F 0.25909980430528373 0.45072033517111976 0.3256465859205585 0.20718302658486706 0.26855055747789314 0.21008797653958944 0.19722996672362622 0.27407403610327685 dbmdz/bert-base-turkish-cased P 0.2939334637964775 0.4726966624256211 0.3287847534422877 0.18852249488752557 0.2750865051903114 0.2133724340175953 0.20964115478010611 0.2831482097914178 dbmdz/bert-base-turkish-cased R 0.32289628180039137 0.4839290074668106 0.3352726503411435 0.2312116564417178 0.2877739331026528 0.2129032258064516 0.21731570584884732 0.298757494401145 dbmdz/bert-base-turkish-cased F 0.31859099804305285 0.4864993381398517 0.3396801889952575 0.22124233128834356 0.29257977700884275 0.22041055718475072 0.21977396048805348 0.2998253073068789 bert-base-multilingual-cased P 0.338160469667319 0.5064708074693809 0.3588265369087287 0.22955010224948874 0.3075740099961553 0.23049853372434018 0.22888748988218366 0.31428113569965666 bert-base-multilingual-cased R 0.3573385518590998 0.5134107002865919 0.36771213483542253 0.25639059304703476 0.32410611303344866 0.2300293255131965 0.2407590610666427 0.3271066399487767 bert-base-multilingual-cased F 0.3585127201565558 0.5162380640269371 0.36690114772306553 0.25140593047034765 0.31718569780853517 0.24269794721407625 0.24279761369427708 0.3279627315848278 roberta-large P 0.3831702544031311 0.54582257007364 0.39514465541862803 0.2939672801635992 0.35140330642060746 0.29337243401759533 0.2487933567167311 0.35881055103056175 roberta-large R 0.40626223091976515 0.5499350991505058 0.39761287706493187 0.3182515337423313 0.35851595540176856 0.29665689149560115 0.25802680097131037 0.3693230555351735 roberta-large F 0.4140900195694716 0.5548187274292837 0.4033603074698965 0.30610940695296524 0.354479046520569 0.3020527859237537 0.26480199058668347 0.3713874692075177 allenai/scibert_scivocab_uncased P 0.3232876712328767 0.506162367788616 0.3475784982634298 0.21945296523517382 0.3081507112648981 0.22205278592375366 0.2200737476391762 0.3066798210497034 allenai/scibert_scivocab_uncased R 0.3557729941291585 0.5114572489750806 0.35551206784083494 0.25140593047034765 0.31103421760861205 0.23096774193548386 0.23560272206733218 0.32167898900383574 allenai/scibert_scivocab_uncased F 0.34794520547945207 0.5180887021115267 0.3570282611378502 0.24821063394683027 0.31391772395232603 0.23026392961876832 0.23302455256767696 0.3212112869734901 bert-base-chinese P 0.225440313111546 0.4306460526146689 0.30604185398705946 0.17356850715746422 0.23317954632833526 0.19390029325513197 0.18409928950445184 0.24955369370837963 bert-base-chinese R 0.2802348336594912 0.4495379830615209 0.32434195447894076 0.21434049079754602 0.2808535178777393 0.20351906158357772 0.1951314566657673 0.27827989973208334 bert-base-chinese F 0.25909980430528373 0.45072033517111976 0.3256465859205585 0.20718302658486706 0.26855055747789314 0.21008797653958944 0.19722996672362622 0.27407403610327685 dbmdz/bert-base-turkish-cased P 0.2939334637964775 0.4726966624256211 0.3287847534422877 0.18852249488752557 0.2750865051903114 0.2133724340175953 0.20964115478010611 0.2831482097914178 dbmdz/bert-base-turkish-cased R 0.32289628180039137 0.4839290074668106 0.3352726503411435 0.2312116564417178 0.2877739331026528 0.2129032258064516 0.21731570584884732 0.298757494401145 dbmdz/bert-base-turkish-cased F 0.31859099804305285 0.4864993381398517 0.3396801889952575 0.22124233128834356 0.29257977700884275 0.22041055718475072 0.21977396048805348 0.2998253073068789 bert-base-multilingual-cased P 0.338160469667319 0.5064708074693809 0.3588265369087287 0.22955010224948874 0.3075740099961553 0.23049853372434018 0.22888748988218366 0.31428113569965666 bert-base-multilingual-cased R 0.3573385518590998 0.5134107002865919 0.36771213483542253 0.25639059304703476 0.32410611303344866 0.2300293255131965 0.2407590610666427 0.3271066399487767 bert-base-multilingual-cased F 0.3585127201565558 0.5162380640269371 0.36690114772306553 0.25140593047034765 0.31718569780853517 0.24269794721407625 0.24279761369427708 0.3279627315848278

Hou-jing commented 1 year ago

wmt18_log.csv

twadada commented 1 year ago

Hi, I've also got the same result as @Hou-jing using RoBERTa-large, and I'm also wondering why the numbers are different from the reported scores in the paper. I also tried to reproduce the results on WMT17-seg and got different (slightly better) results (0.7614 on tr-en), so I'm assuming the implementation has changed a bit. Would @felixgwu @Tiiiger or someone clarify this?

felixgwu commented 1 year ago

Hi @Hou-jing and @twadada,

The script get_wmt18_seg_result.py is for WMT18 segment-level kendall correlation scores (Table 4 in the paper) and another script get_wmt17_sys_results.py is for WMT17 system-level pearson correlation scores (Tables 15 and 17 in the paper). To get WMT18 system-level scores, you'll need to modify get_wmt17_sys_results.py to use WMT18 data.

Admittedly, we observe that these scripts give slightly different scores compared to the paper. We found that BERTScores can change slightly with newer pytorch and huggingface's transformers versions. We will look into this.

Tiiiger / bert_score

get_wmt18_seg_result question #143