Closed abhi-agg closed 2 years ago
If these are the correct thresholds we can also lose the "these thresholds are just examples" comments
I just want to confirm with @mfomicheva @abarbosa94 @felipesantosk once more that the threshold of -0.5
is a good one as a starting point for all the language pairs irrespective of whether the quality scores are returned using translation models or supervised QE models under the hood.
I can remove the comment after their confirmation. Thanks for pointing out 👍🏾
I just want to confirm with @mfomicheva @abarbosa94 @felipesantosk once more that the threshold of
-0.5
is a good one as a starting point for all the language pairs irrespective of whether the quality scores are returned using translation models or supervised QE models under the hood.I can remove the comment after their confirmation. Thanks for pointing out 👍🏾
I responded on slack
Just documenting what @mfomicheva shared:
For the supervised models that were fitted on annotated data (En-Es, En-Cs and En-Et language pairs), you should use the threshold that corresponds to the log of 0.5, which is around -0.6931 (here log
means ln
)
For the unsupervised case where the returned value is just the average log-prob coming directly from the MT model, I think you should still start with the same threshold and experiment further with it
The range [-0.6931, 0] means better quality
I will modify PR to reflect these changes. Updating the description of the PR as well.
Higher QE scores means better quality. Changed the threshold to
-0.5
ln(0.5) => -0.6931
as per discussions in QE meetings.@mfomicheva @abarbosa94 @felipesantosk Please let me know if any of the above is wrong 👍🏾