ChangeIsKey / annotation_standardization

2 stars 1 forks source link

Annotation Standardization

Introduction

Welcome to the repository for the standardization of text annotation tasks. This document provides an overview of the structure of the respository as well as general formatting specifications for data files. The specifications outlined below are not exhaustive and may vary depending on your annotation task. This repository supports the following annotation tasks:

Structure of the repository

Each task follows the structure of the below diagram.

task_type
│
└───task_name
        │   README.md
        │   guidelines.md
        │
        └───language
                |
                └───tutorial
                │        uses.tsv
                │        instances.tsv
                │        judgments.tsv
                │       
                └───data
                         uses.tsv
                         instances.tsv
                         judgments.tsv

Each task_type will be one of the following:

For each task_name, you will find a task-specific README file which contains suggested configurations for data files. Additionally, you will find a Guidelines document containing guidelines for annotators. Each language folder contains the folders tutorial and data. The tutorial folder contains a small set of data for training annotators. The data file contains a set of real, pre-annotated data.

General Format

All data must be provided in the .tsv format. Each annotation task will (at minimum) contain uses.tsv files, instances.tsv files, and judgments.tsv files. The general format for each of these files is described below. However, depending on the task, additional fields and files may be required. The README for each task will specify the format for the instances.tsv and judgments.tsv files as well as any additional files that may be required.

Uses

A uses.tsv file contains a set of texts to be annotated and has the following fields:


Scripts

The scripts folder is structured according to the below diagram.

scripts
|   
└─── random_annotator
|       annotation_provider.py
|       random_annotate.py
|
└─── dwug_converter
|       convert_dwug.py
|
└─── transform_wssim
|       transform_wssim.py
|
└─── transform_wsbest
|       transform_wsbest.py
|
└─── transform_lexsub
|       transform_lexsub.py         
|
|
└─── evaluation
        evaluation.py

The scripts folder, contains three directories: random_annotator and dwug_converter, and evaluation.

The random_annotator randomly generates randomly annotated data for a given annotation task. Instructions for running the random_annotate.py script are given in the task-specific README.

The dwug_converter downloads pre-annotated Word Usage Graph (WUG) data (https://www.ims.uni-stuttgart.de/en/research/resources/experiment-data/wugs/), and converts it to the standard format outlined in this repository. Instructions for running the convert_dwug.py script can be found in the task-specific README.

The transform_wssim script formats pre-annotated data for the WWSIM task according to the standard format outlined in this repository. The data can be found at http://www.dianamccarthy.co.uk/downloads/WordMeaningAnno2012/. Instructions for running the transform_wssim.py script can be found in the task-specific README.

The evaluation script generates an evaluation.tsv file containing results accross various evlalutation metrics given two .tsv files in the judgments format. For example, this script can be used to evaluate automatically annotated data against gold standard annotated data. More information on running the script can be found in the task specific README.


License

This project is licensed under the terms of the MIT license.