WMT-QE-Task / wmt-qe-2023-data

8 stars 2 forks source link

Evaluation script for the tasks and definition of the scores #2

Open alvations opened 1 year ago

alvations commented 1 year ago

Thank you for sharing the gold labels. Would you be able to share the .py file or evaluation scripts used to create the spearman / kendall tau for the task 1?

And if possible the eval scripts for the rest of the tasks too.

Q: How do we interpret the scores from the gold labels?

E.g. we see negative scores, and it looks like the range is between [-???, 1]. Does that mean that the expected range of the systems we create should fall within the range?

lp      gold    sid     score
en-de   gold    0       0.5226848731820275
en-de   gold    1       0.5226848731820275
en-de   gold    2       0.5226848731820275
en-de   gold    3       0.5226848731820275
en-de   gold    4       -0.8813787858879307
en-de   gold    5       0.5226848731820275
en-de   gold    6       -2.207438908342892
en-de   gold    7       0.5226848731820275
en-de   gold    8       0.2769737328447848
en-de   gold    9       -0.5303628711204413
en-de   gold    10      -1.5834106154229102
en-de   gold    11      -2.987474274492868
en-de   gold    12      -1.2973976478345852

Q: How was the gold labels created? By human rating through direct assessment?


Thank you in advance for the clarification!

chryssa-zrv commented 1 year ago

Hello,

The script is available here: https://github.com/WMT-QE-Task/qe-eval-scripts/blob/main/wmt23/task1_sentence_evaluate.py

The scores are normalised per annotator (original scores were on a different scale, a mix of direct assessments and MQM), so the range changes for each LP, but as we only compute correlations, what matters is the ordering of the scores within each language pair. If your model is on a different scale it should still be fine, especially for Spearman which is the primary metric.

Let me know if this helps!

alvations commented 1 year ago

Thank you for the prompt reply! Yes the script you've pointed to helps.

We're trying to find out the difference in the spearman vs pearson. We'll look into the scipy's implementation since that's where the function is from.

One quick follow-up question, is there a link to the baseline predictions that'll be needed to run the eval script?