Lightning-AI / torchmetrics

Torchmetrics - Machine learning metrics for distributed, scalable PyTorch applications.
https://lightning.ai/docs/torchmetrics/
Apache License 2.0
2.07k stars 395 forks source link

Add `GEMBA` (GPT-based metric for assessment of translation quality) #1579

Open stancld opened 1 year ago

stancld commented 1 year ago

🚀 Feature

Add GEMBA, a GPT-based metric for assessment of translation quality, introduced in Large Language Models Are State-of-the-Art Evaluators of Translation Quality.

Repo: https://github.com/MicrosoftTranslator/GEMBA

Motivation

Cover another cool SOTA translation metric.

Pitch

Add this metric via leveraging OpenAI Python API. I think this would likely require to use caching of results for our test not to be out of quota (not really familiar with their quotas on their services).

Alternatives

We can also add standard support for models available on 🤗 transformers as we do for BertScore or InfoLM.

Additional context

We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate seven versions of GPT models, including ChatGPT. We show that our method for translation quality assessment only works with GPT 3.5 and larger models. Comparing to results from WMT22’s Metrics shared task, our method achieves state-of-theart accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.

HanzhiZhang-Ulrica commented 2 months ago

Hi, r u still working on this metric? May I know when will we able to use this one?

Borda commented 1 month ago

Hi, r u still working on this metric? May I know when will we able to use this one?

or would you be interested and contribute it to TM? :rabbit: