UniversalDependencies / UD_Irish-IDT

Irish data
Other
6 stars 7 forks source link

Summary

A Universal Dependencies 4910-sentence treebank for modern Irish.

Introduction

The Irish UD Treebank (IUDT) is a conversion of the Irish Dependency Treebank (IDT), which was part of a PhD research project by Teresa Lynn at Dublin City University, Ireland (Lynn, 2016).

---- The (smaller) IDT dataset has also been released on [GitHub] (https://github.com/tlynn747/IrishDependencyTreebank). ----

The Treebank contains 4910 sentences.

The first 2924 of which were taken from the New Corpus of Ireland-Irish (NCII), with text from books, newswire, websites and other media. These sentences are a subset of a gold-standard POS-tagged corpus for Irish made available by Elaine Uí Dhonnchadha of Trinity College Dublin. ----

The subsequent 1986 sentences were taken from a corpus of Irish public administration translations and are available under the Open Data (PSI) directive for sharing of pubic data: Citizens information website: (20%) Dublin City Council (DCC): (25%) DEpartment of Culture, Heritage and the Gaeltacht (DCHG):(9%) Udaras na Gaeltachta: (25%) EUbookshop: (21%)

The conversion from the IDT annotation scheme to the UD annotation scheme for the first release (1020 IDT trees) was designed by Teresa Lynn and Jennifer Foster at Dublin City University, Ireland. The mapping to UD is reported in Lynn et al., (2016) Conversion of sentences 1-1020 was automatic, with manual review. Subsequent updates or changes have been a combination of automatic labelling and manual review. All trees with sentence ID greater than 1021 were created through an automatic pre-parsing approach followed by manual review.

The UD Treebank is split into two sets as follows:

Note: the 451 dev trees were taken from the set of newly annotated trees in the v2.5 release. Selection of test sentences haven't changed since v1.0 (but annotations and quality have!)

Acknowledgements

We wish to thank all of the contributors to the original IDT annotation, including Elaine Uí Dhonnchadha for her gold POS-tagged corpus and linguistic advice. We would also like to acknowledge linguistic advice offered by Kevin Scannell in the conversion to UD effort.

Expansion of the IUDT from 2019-2021 is funded by the Irish Government Department of Culture, Heritage and the Gaeltacht under the GaelTech project.

This research is partially supported by Science Foundation Ireland through the ADAPT Centre for Digital Content Technology. The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Statistics

Trees: 4910 Token count: 23686 Dependency Relations: 36 of which 10 language specific POS tags: 17

Tokenisation

The tokenisation of the Irish data is the output of a Xerox Finite State tokenizer implemented by Uí Dhonnchadha (2002)

Note:

Other multi-word units are Proper Noun Strings (flat) and some fixed expressions like "cés moite" (apart from). MWE research is being carried out by Abigail Walsh (ADAPT Centre, DCU) as part of her PhD.

Syntax

The Irish UD treebank uses 26 of the UD dependency labels. A further 10 language specific labels were introduced to deal with certain linguistic phenomena in Irish:

Morphological Features

Word of caution for anyone including morphological features in training parsing models: There are a number of issues relating to morphological features. Many were missing in the v2.6 release (e.g. Case=NomAcc) and the v2.7 expansion set (sentence 2925 onwards) were automatically predicted. Not all of these have yet been fully manually reviewed. This is expected to be completed for release v2.8.

References

Christian Brothers, 1988. New Irish Grammar, Dublin: C J Fallon

Lynn, Teresa , Ozlem Cetinoglu, Jennifer Foster, Elaine Uí Dhonnchadha, Mark Dras and Josef van Genabith, [Irish Treebanking and Parsing: A Preliminary Evaluation] (http://www.lrec-conf.org/proceedings/lrec2012/pdf/378_Paper.pdf), LREC 2012, Istanbul, May 2012

Lynn, Teresa, Jennifer Foster, Mark Dras and Elaine Uí Dhonnchadha, [Active Learning and the Irish Treebank] (http://www.alta.asn.au/events/alta2012/proceedings/pdf/U12-1005.pdf), ALTA 2012, Dunedin, NZ, December 2012

Lynn, Teresa, Jennifer Foster, Mark Dras and Josef van Genabith, [Working with a small dataset — semi-supervised dependency parsing for Irish] (http://www.nclt.dcu.ie/~tlynn/spmrl.pdf), SPMRL 2013, Seattle, USA, October 2013

Lynn, Teresa, Jennifer Foster, Mark Dras and Lamia Tounsi, [Cross-lingual Transfer Parsing for Low-Resourced Languages: An Irish Case Study] (http://www.nclt.dcu.ie/~tlynn/CLTW.pdf) CLTW 2014, Dublin, Ireland, August 2014

Teresa Lynn, [Irish Dependency Treebanking and Parsing] (http://www.nclt.dcu.ie/~tlynn/Teresa_PhDThesis_final.pdf), PhD Thesis, Dublin City University, Ireland and Macquarie University, Sydney, Australia, 2016

Lynn, Teresa and Jennifer Foster, [Universal Dependencies for Irish] (http://www.nclt.dcu.ie/~tlynn/Lynn_CLTW2016.pdf), CLTW 2016, Paris, France, July 2016

McGuinness, Sarah, Jason Phelan, Abigail Walsh and Teresa Lynn, Annotating MWEs in the Irish UD Treebank, In Proceedings of the Fourth Universal Dependencies Workshop, COLING 2020, Barcelona, Spain (to appear)

Stenson, N, 1981. Studies in Irish Syntax, Tübingen: Gunter Narr Verlag.

The Christian Brothers, New Irish Grammar, Dublin, Ireland: C.J. Fallon, March 1994

Uí Dhonnchadha, E. 2002. An Analyser and Generator for Irish Inflectional Morphology using Finite State Transducers, School of Computing, Dublin City University: Unpublished MSc Thesis.

Uí Dhonnchadha, E. 2009. Part-of-Speech Tagging and Partial Parsing for Irish using Finite-State Transducers and Constraint Grammar (PhD thesis)

Changelog

15-05-2015 (v1.1)

30-10-2015 (v1.2)

31-10-2015 (v1.2)

15-02-2017 (v2.0)

15-04-2018 (v2.2)

01-11-2018 (v2.2)

30-04-2019 (v2.4)

31-10-2019 (v2.5)

29-04-2020 (v2.6)

cleanup of v2.5 trees (sentences 1-1763)

29-04-2020 (v2.6)

30-10-2020 (v2.7)

29-04-2021 (v2.8)

Summary of changes:

Notable changes in annotation choices

Removing inconsistencies:

Metadata

=== Machine-readable metadata (DO NOT REMOVE!) ================================ Includes text: yes Lemmas: manual native UPOS: manual native XPOS: manual native Features: automatic with corrections Relations: manual native Data available since: UD v1.0 License: CC BY-SA 3.0 Genre: news fiction web legal government Contributors: Lynn, Teresa; Foster, Jennifer; McGuinness, Sarah; Walsh, Abigail; Phelan, Jason; Scannell, Kevin Contributing: elsewhere Contact: teresa.lynn@adaptcentre.ie; jennifer.foster@dcu.ie