This repository contains the code for the EACL 2023 Findings paper PriMeSRL-Eval: A Practical Quality Metric for Semantic Role Labeling Systems Evaluation
We have our proposed evaluation of SRL quality and the official evaluation from the CoNLL2005 (span evaluation) and CoNLL2009 (head evaluation). We use an SRL format that extends the CoNLL-U format (see SRL format section below).
We have the following pipelines:
/data
repositoryconda create -n eval python=3.9
conda activate eval
pip install -r requirements.txt
pytest tests
python run_evaluations.py --gold-conllu <file> --pred-conllu <file> --format <str>--output-folder <folder>
--format
can take following values: [conll09, conll05, conllu]
data
and tests/data
folder for examples.python run_evaluations.py -g data/conll09/gold_file -p data/conll09/pred_file -f conll09 -o tmp
The evaluation script will show the results from the official CoNLL scripts and our proposed evaluation method. Please see the paper on how to interpret and compare these numbers.
We have encoded all examples in our paper as unit tests. See tests/README.md
for how to match up numbers in the tests with those presented in the paper.
In short, the data for the tables are in the tests/data/<evaluation>/input
with similar naming scheme to the examples in the table. The evaluation results
presented in the paper are in these folders with this structure:
tests/data/<evaluation>/expected/compare-*/comparison-results-[official-conll|proposed].csv
.
We provide a script to format these comparison results, example usages:
python format_results.py -c tests/data/sense/expected/compare-sense_test-sense_pred_p1/comparison-results-official-conll.csv
python format_results.py -c tests/data/sense/expected/compare-sense_test-sense_pred_p1/comparison-results-proposed.csv
We use an extended CoNLL-U format
that replaces the MISC
column with the additional columns below:
ISPRED
- Flag Y
when the token is a predicate, _
otherwise.PREDSENSE
- Predicate sense from PropBank.ARGS
- One column for each predicate, in order of appearance, i.e. first
argument column contains the arguments for the first predicate.To contribute to this repository, particularly new unit tests of other interesting error combinations, please open an issue for discussion and any subsequent PR.
@misc{jindal2022primesrleval,
title={PriMeSRL-Eval: A Practical Quality Metric for Semantic Role Labeling Systems Evaluation},
author={Ishan Jindal and Alexandre Rademaker and Khoi-Nguyen Tran and Huaiyu Zhu and Hiroshi Kanayama and Marina Danilevsky and Yunyao Li},
year={2022},
eprint={2210.06408},
archivePrefix={arXiv},
primaryClass={cs.CL}
}