This repository contains a python implementation of the GLEU metric (General Language Evaluation Understanding), which can be used for any monolingual "translation" task. It also contains human rankings of the CoNLL-14 Shared Task system output as well as scripts to evaluate the rankings to extract an absolute system ranking.
These results were described in the ACL 2015 paper:
Ground Truth for Grammatical Error Correction Metrics by Courtney Napoles, Keisuke Sakaguchi, Joel Tetreault, and Matt Post
Please cite this work when using this data or the GLEU metric.
@InProceedings{napoles-EtAl:2015:ACL-IJCNLP,
author = {Napoles, Courtney and Sakaguchi, Keisuke and Post, Matt and Tetreault, Joel},
title = {Ground Truth for Grammatical Error Correction Metrics},
booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
month = {July},
year = {2015},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
pages = {588--593},
url = {http://www.aclweb.org/anthology/P15-2097}
}
As of May 2, 2016, we have identified a problem with the GLEU metric as the number of references increases.
To resolve this issue, we made a minor adjustment to the metric so that it no longer has a tunable weight and is reliable using any number of reference sets.
This update to GLEU is reflected in scripts/compute_gleu
and scripts/gleu.py
.
The original GLEU scripts can be found in scripts/original_gleu/
.
We do not recommend using the original GLEU code. The new GLEU should be used instead.
The changes to GLEU and updated results to our ACL 2015 paper are described in the eprint, GLEU Without Tuning. The citation for the updated metric is
@Article{napoles2016gleu,
author = {Napoles, Courtney and Sakaguchi, Keisuke and Post, Matt and Tetreault, Joel},
title = {{GLEU} Without Tuning},
journal = {eprint arXiv:1605.02592 [cs.CL]},
year = {2016},
url = {http://arxiv.org/abs/1605.02592}
}
The rankings found in the gec-ranking-data correspond to the 12 system outputs from the CoNLL-14 Shared Task on Grammatical Error Correction, which can be downloaded from http://www.comp.nus.edu.sg/~nlp/conll14st.html.
Human judgments are located in gec-ranking/data
.
To get the human rankings, run TrueSkill (which can be downloaded from
https://github.com/keisks/wmt-trueskill) on all_judgments.csv
, following
the instructions in the TrueSkill readme.
GLEU is included in gec-ranking/scripts
. To obtain the GLEU scores for
system output, run the following command:
./compute_gleu -s source_sentences -r reference [reference ...] \
-o system_output [system_output ...] -n 4 -l 0.0
where each file contains one sentence per line. GLEU can be run with multiple references. To get the GLEU scores of multiple outputs, include the path to each system output file. GLEU was developed using Python 2.7.
I-measure scores were taken from Felice and Briscoe's 2015 NAACL paper, Towards a standard evaluation method for grammatical error detection and correction. The I-measure scorer can be downloaded from https://github.com/mfelice/imeasure.
M2 scores were calculated using the official scorer (3.2) of the CoNLL-2014 Shared Task (http://www.comp.nus.edu.sg/~nlp/sw/).
There was an error in the calculation of the GLEU denominator, which was corrected in the 10 March 2016 commit.
Please contact Courtney Napoles (courtneyn[at]jhu[dot]edu) or Keisuke Sakaguchi (keisuke[at]cs[dot]jhu[dot]edu) with any questions.
Last updated 10 May 2016