IAHLT / UD_Hebrew

Hebrew Universal Dependencies Treebank
Other
2 stars 2 forks source link

Summary

IAHLT version of the UD Hebrew Treebank (IAHLT-HTB)

Introduction

What is this?

This is a revised fork of the Universal Dependencies version of the Hebrew Treebank, with some important changes and a consistency overhaul involving substantial manual corrections. The dataset was prepared as part of the Hebrew & Arabic Corpus Linguistics Infrastructure project at the Israeli Association of Human Language Technologies (IAHLT).

Before using this data it is highly recommended to read the IAHLT treebanking documentation (coming soon!) and reading the general principles outlined below. This dataset is currently still a work in progress.

Universal Dependencies - Hebrew Dependency Treebank (v2) https://github.com/UniversalDependencies/UD_Hebrew

General principles

This version of the HTB data follows the following principles:

History

V1 of the dependency corpus was built by semi-automatic conversion of the Hebrew Constituency Treebank (v2) by MILA.

V2, refered to below as UD-HTB, was converted from V1, using a combination of automatic conversion when possible, and manual conversion and verification in other cases (see papers below).

This version is currently refered to as IAHLT-HTB.

Structure

This directory contains a corpus of sentences annotated using Universal Dependencies annotation. The corpus comprises 115K tokens (158K words) and 6,216 sentences, taken from the Ha'aretz newspaper. The trees were manually annotated into phrase-structure trees, and then semi-automatically converted into Universal Dependencies.

This file is compatible with the CoNLL-U format defined for Universal Dependencies. See: http://universaldependencies.github.io/docs/format.html . However, at present the files do not include lemmas for words. These may be added in a later release.

The dependency taxonomy can be found on the Universal Dependencies web site:

http://universaldependencies.github.io/docs/
http://universaldependencies.github.io/docs/#language-he

The Train/Dev/Test split follows previous splits of the underlying Treebank, namely: sentences 1-484 dev (~10K tokens), 485-5725 train (~127K tokens), 5726-6216 test (~11K tokens).

Some parts of the structure are more reliable than others. In particular, words with a "morphological feature" entry of HebSource=ConvUncertainHead or HebSource=ConvUncertainLabel indicate that the head (label) information for this token is based on unreliable information.

Fixes

To help improve the corpus, please alert us to any errors you find in it; For underlying issues in the source data (UD-HTB) contact Yoav Goldberg at yoav.goldberg@gmail.com or Reut Tsarfaty at reut.tsarfaty@gmail.com

For issues specific to the IAHLT-HTB version, please contact Amir Zeldes at amir-zeldes@georgetown.edu

Acknowledgments

The Universal Dependencies Hebrew Treebank created by: (in alphabetic order):

Revised IAHLT version:

The following people were also involved in the creation of v2:

The Universal Dependencies Hebrew Treebank is based on the Hebrew Constituency Treebank (v2) developed by MILA, The Knowledge Center for Processing Hebrew. (http://www.mila.cs.technion.ac.il/resources_treebank.html)

References

You are encouraged to cite these papers reflecting the original source treebank if you use the Hebrew Universal Dependencies Treebank:

    @inproceedings{tsarfaty2013unified,
        title={A Unified Morpho-Syntactic Scheme of Stanford Dependencies},
        author={Tsarfaty, Reut},
        booktitle={Proc. of ACL},
        year={2013}
    }

    @inproceedings{mcdonald2013universal,
        title={Universal Dependency Annotation for Multilingual Parsing},
        author={McDonald, Ryan T and Nivre, Joakim and Quirmbach-Brundage, Yvonne and Goldberg, Yoav and Das, Dipanjan and Ganchev, Kuzman and Hall, Keith B and Petrov, Slav and Zhang, Hao and T{\"a}ckstr{\"o}m, Oscar and others},
        booktitle={Proc. of ACL},
        year={2013}
    }

Note that these papers do not accurately reflect the current annotation in the Treebank. A more up-to-date publication discussing the IAHLT scheme and tokenization is:

@InProceedings{ZeldesHowellOrdanBenMoshe2022,
  author    = {Amir Zeldes and Nick Howell and Noam Ordan and Yifat Ben Moshe},
  booktitle = {Proceedings of {EMNLP} 2022},
  title     = {A Second Wave of {UD} {H}ebrew Treebanking and Cross-Domain Parsing},
  year      = {2022},
  pages     = {4331--4344},
  address   = {Abu Dhabi, UAE},
  url       = {https://aclanthology.org/2022.emnlp-main.292/},
}