clamsproject / aapb-annotations

Repository to store manual annotation dataset developed for CLAMS-AAPB collaboration
3 stars 0 forks source link

AAPB-CLAMS Annotation Repository

This repository contains datasets from manual annotation projects in AAPB-CLAMS collaboration.

Project Information

American Archive of Public Broadcasting (AAPB) has involved the CLAMS team to develop information extraction systems for digital archives of public media (primarily video and audio from publicly-funded tv shows and radio broadcasts). This will facilitate the research and preservation of significant historical content from this media collection. Some parts of the process of archiving, summarizing and extracting metadata from media assets could eventually be automatic. This repository/endeavor provides training and evaluation data for the machine learning-based CLAMS apps in this process.

Structure of This Repository/Directory

batches Subdirectory

The first subdirectory is the special batches subdirectory. This special subdirectory maintains tracking source data for the whole repository/annotation endeavor.

Smaller selections of the AAPB collection are chosen and cataloged as batches in this subdirectory. These sets are chosen for variety or utility needed for the applications developed here.
A batch is the set of the identifying GUIDs/tags for that group of media assets.

Batches are decided some time before annotations begin. Annotation projects then choose appropriate batches for each moment/period of annotation work. (See raw annotation section.)

Specifically, this directory contains .txt files named after the batch name.

[!NOTE] AAPB-GUID is not Universally unique identifier, but just a unique identifier within the scope of the AAPB system.

Project Subdirectories

Every other subdirectory in this repository represents a specific annotation project, its datasets and processing tools.
This includes its raw annotated data file, gold-formatted final output data file for tool ingestion, software-suite for converting from raw to gold, and a project-specific readme.md explaining it and its annotation guidelines.

The subdirectory name is the name of the project. Each subdirectory contains the following files:

Raw annotation files

[!IMPORTANT] YYMMDD-batchName directory

This contains raw output files from the manual annotation process created by the annotation tool (or by hand like a .csv file).

As the name of this subdirectory suggests, the raw annotation files are organized by the batch name and the date of the annotation. Namely, a single "period" of the annotation is the whole process of a single batch of source data (AAPB assets) being annotated. The YYMMDD- prefix must indicate the time when a batch of annotation is conducted. (e.g., when the batch is decided to be annotated) These prefixes are used for the sorting of annotation processes and machine ingestion of the raw data. The batchName part of the directory name must match only one of .txt files in the batches directory.

Different annotation tools create different file formats with diverse formatting. Hence, we need conversion of the raw annotation files to files with a common format that we call golds.

Gold dataset files

[!IMPORTANT] golds directory

This directory contains the public "gold" dataset generated by the above script.

The gold dataset is a set of files that are in a format that is ready for machine consumption primarily for

  1. training ML models for CLAMS apps,
  2. evaluation of CLAMS app outputs,
  3. other public usage

In other words, the distinction between raw and golds are purely for machine consumption. As we keep some rules for how golds files are organized (see below), users of the AAPB-CLAMS dataset may find it easier to use golds data than raw data for machine consumption.

Codebase for format conversion

[!IMPORTANT] (usually) process.{sh,py} and dependencies

A piece of software to process the raw annotation files and generate the gold dataset. The input file format (i.e., direct output from the annotation process) can vary (e.g. .csv, .json, .txt). The output file format must be a common machine-readable data format (CSV, JSON, definitely not MMIF), and subject to change for any future requirements in the consumption software. Thus, users of a gold dataset should be aware of the version of the gold dataset they are using, and are recommended to use permalinks to refer to a specific version of the gold dataset in their code or documentation.

To ensure consistency between data consumption software, there are a few requirements for the process.py.

  1. The script must generate one file per GUID.
  2. The number of gold files in this directory must match the sum of GUIDs in all batches (YYMMDD-xxx subdirectories) annotated.
    • Namely, there must not be any overlap between assets in batches.
  3. golds directory must not have subdirectories.

In addition to the main code file, if the code requires additional dependencies/scripts, they can stay in the same level at that subdirectory. The dependencies information can be written down in the README.md file or in a machine-friendly file with the list of dependencies (e.g. requirements.txt for pip).

And finally, check the conventions section for the naming conventions for common field/column names for golds data.

Information README

[!IMPORTANT] README.md (and possibly guidelines.{md,ppt})

Project-specific information, including but not limited to:

[!NOTE] readme.md & guidelines.{md,ppt} files are supposed to be actively maintained by the project manager. All guideline files are recommended to be version-controlled.

Repository-level Conventions

Please see the Repository-level Conventions file for standardizations, explanations and conventions.

TL;DR

[!IMPORTANT] Media Time = hh:mm:ss.mmm with a DOT
Annotation times are usually a little imprecise because audiovisual phenomena are, or visualizing/labelling of such is.
Some estimates of imprecision are given by Margin of Error.
Directionality definitions help frame the boundaries meant by annotated times.
The fields in the gold datasets should be standardized.

List of Current Projects/Subdirectories

This section is currently manually updated and may be incomplete. It contains information up to the readme's editing date.

Issue Tracking and Conversation Archive

Progress and other discussion by AAPB/CLAMS/WBGH is tracked via the open and closed Github Issues feature. Finally, please email CLAMS.ai admin for other inquiries.