Welcome to the repository for the standardization of text annotation tasks. This document provides an overview of the structure of the respository as well as general formatting specifications for data files. The specifications outlined below are not exhaustive and may vary depending on your annotation task. This repository supports the following annotation tasks:
Each task follows the structure of the below diagram.
task_type
│
└───task_name
│ README.md
│ guidelines.md
│
└───language
|
└───tutorial
│ uses.tsv
│ instances.tsv
│ judgments.tsv
│
└───data
uses.tsv
instances.tsv
judgments.tsv
Each task_type
will be one of the following:
For each task_name
, you will find a task-specific README file which contains suggested configurations for data files. Additionally, you will find a Guidelines document containing guidelines for annotators. Each language
folder contains the folders tutorial
and data
. The tutorial
folder contains a small set of data for training annotators. The data
file contains a set of real, pre-annotated data.
All data must be provided in the .tsv format. Each annotation task will (at minimum) contain uses.tsv files, instances.tsv files, and judgments.tsv files. The general format for each of these files is described below. However, depending on the task, additional fields and files may be required. The README for each task will specify the format for the instances.tsv and judgments.tsv files as well as any additional files that may be required.
A uses.tsv file contains a set of texts to be annotated and has the following fields:
dataID: A unique text ID
context: The context of use in which the target lemma occurs.
indices_target_token: The character indices of the target token in context (Python list ranges used in slicing). Multiple index slices are possible in certain cases. For example, many English phrasal verbs allow an object between the verb and the preposition. e.g. switch off: 1) He switched the light off, 2) He switched off the light
indices_target_sentence: The index slice containing the sentence in which the target occurs.
lemma: The target lemma.
info: An optional field to provide additional information.
The instances file contains a set of annotation instances to be judged by the annotator. The instances.tsv file contains the following fields:
instanceID: A unique ID for an annotation instance.
dataIDs: Given in the task-specific README file. The dataIDs column references data needed to perform an annotation task.
tag: An optional tag field.
label_set: The set of labels from which the annotator must select. The label_set is given in the task-specific README file.
cannot_annotate: A special label selected by the user when they are unable to provide an annotation.
info: An optional field to provide additional information for each instance.
The judgments.tsv file is the format for representing annotated instances as well as the set of gold standard annotations. The judgments.tsv file contains the following fields:
instanceID: The unique instance ID corresponding to an annotation instance in the instances.tsv file.
label: The label value for a given annotation instance, given in the task-specific README.
comment: A comment field for the annotator.
The scripts folder is structured according to the below diagram.
scripts
|
└─── random_annotator
| annotation_provider.py
| random_annotate.py
|
└─── dwug_converter
| convert_dwug.py
|
└─── transform_wssim
| transform_wssim.py
|
└─── transform_wsbest
| transform_wsbest.py
|
└─── transform_lexsub
| transform_lexsub.py
|
|
└─── evaluation
evaluation.py
The scripts
folder, contains three directories: random_annotator
and dwug_converter
, and evaluation
.
The random_annotator
randomly generates randomly annotated data for a given annotation task. Instructions for running the random_annotate.py script are given in the task-specific README.
The dwug_converter
downloads pre-annotated Word Usage Graph (WUG) data (https://www.ims.uni-stuttgart.de/en/research/resources/experiment-data/wugs/), and converts it to the standard format outlined in this repository. Instructions for running the convert_dwug.py script can be found in the task-specific README.
The transform_wssim
script formats pre-annotated data for the WWSIM task according to the standard format outlined in this repository. The data can be found at http://www.dianamccarthy.co.uk/downloads/WordMeaningAnno2012/. Instructions for running the transform_wssim.py script can be found in the task-specific README.
The evaluation
script generates an evaluation.tsv file containing results accross various evlalutation metrics given two .tsv files in the judgments format. For example, this script can be used to evaluate automatically annotated data against gold standard annotated data. More information on running the script can be found in the task specific README.
This project is licensed under the terms of the MIT license.