Shared-task-on-NLG-Evaluation
Repository to organize a shared task on NLG evaluation
Description
We propose a shared task on the evaluation of data-to-text NLG systems, focusing on a common task, e.g. E2E or WebNLG. The shared task asks participants to evaluate system output, and produce (1) scores for each output, (2) a ranking of the different systems. There will be a separate track for qualitative studies. At the event, we discuss the results of the shared task, and afterwards we collaborate with interested parties on a journal paper based on the shared task data.
We encourage original contributions, for example:
- comparing evaluation metrics across different languages.
- studying how linguistic properties of a language (e.g. morphological complexity, flexibility in word order) impact evaluation.
- focusing on evaluation for low-resource languages.
- looking at individual differences in human evaluation (e.g. studying differences in ratings between L1 and L2-speakers).
- carrying out a qualitative study (error analysis) of the differences between the systems.
- comparing evaluation scores for different constructs: fluency, naturalness, grammaticality, readability, correctness, …
- discussing the feasibility of 'overall quality' metrics.
- studying task effects on human evaluation scores. Does response type (e.g. continuous scale versus 5 or 7-point Likert scale) or formulation of the goal of the task influence the scores?
- comparing different implementations of the ‘same’ automatic evaluation metrics.
The shared task data consists of three kinds of ‘output’ texts:
- Real system output.
- Human-generated text.
- Synthetic text, manipulated by the organizers.
We will run all standard evaluation metrics on the shared task data. These scores will be made public, along with the shared task data. The different categories (system, human, synthetic) will be made public at a later date, but preceding the workshop.
Goals
- To establish differences and similarities between different evaluation strategies.
- To study variability in human judgments.
- To study task effects in the presentation of human evaluation tasks.
- To challenge researchers to develop new evaluation metrics for NLG.
Timeline
- Release of the shared task data, and automatic evaluation scores.
- Submission deadline for the evaluation scores.
- Release of the category labels (real, human, synthetic)
- Paper submission
- Notification of acceptance
- Camera-ready deadline
- Workshop
Organizing committee
We're still looking for volunteers. Use issue #1 to express your interest in joining the effort to make this event happen.