UniversalDependencies / UD_Finnish-TDT

Finnish data
Other
8 stars 4 forks source link

Summary

UD_Finnish-TDT is based on the Turku Dependency Treebank (TDT), a broad-coverage dependency treebank of general Finnish covering numerous genres. The conversion to UD was followed by extensive manual checks and corrections, and the treebank closely adheres to the UD guidelines.

Introduction

The treebank contains texts from Wikipedia articles, Wikinews articles, University online news, Blog entries, Student magazine articles, Grammar examples, Europarl speeches, JRC-Acquis legislation, Financial news, and Fiction sourced from 674 individual documents. The original annotation of the treebank was in Stanford Dependencies, including secondary dependencies, and fully manually checked morphological annotation. The treebank is also accompanied by a PropBank annotation (http://turkunlp.github.io/Finnish_PropBank/) and a dependency parser pipeline substantially outperforming the baseline UDPipe model (https://turkunlp.org/Turku-neural-parser-pipeline/).

Acknowledgments

The team behind the Turku Dependency Treebank: Katri Haverinen, Jenna Kanerva (Nyblom), Timo Viljanen, Veronika Laippala, Samuel Kohonen, Anna Missilä, Stina Ojala, Filip Ginter.

We are grateful for the funding received from:

We thank all the authors who kindly allowed us to include their texts into the treebank, either by explicit permission, or by releasing their text under an open license in the first place.

Cite

{% raw %}

@Article{haverinen2013tdt,
  Title                    = {Building the essential resources for {Finnish}: the {Turku Dependency Treebank}},
  Author                   = {Haverinen, Katri and Nyblom, Jenna and Viljanen, Timo and Laippala, Veronika and Kohonen, Samuel and Missil{\"a}, Anna and Ojala, Stina and Salakoski, Tapio and Ginter, Filip},
  Journal                  = {Language Resources and Evaluation},
  Year                     = {2014},
  Note                     = {Open access},
  Pages                    = {493-531},
  Volume                   = {48},

  Doi                      = {10.1007/s10579-013-9244-1},
  ISSN                     = {1574-020X},
  Issue                    = {3},
  Keywords                 = {Treebank; Finnish; Parsing; Morphology},
  Language                 = {English},
  Owner                    = {ginter},
  Publisher                = {Springer Netherlands},
  Timestamp                = {2013.08.15},
  Url                      = {http://dx.doi.org/10.1007/s10579-013-9244-1}
}

@InProceedings{pyysalo2015udfinnish,
  Title                    = {Universal {D}ependencies for {F}innish},
  Author                   = {Pyysalo, Sampo and Kanerva, Jenna and Missil{\"a}, Anna and Laippala, Veronika and Ginter, Filip},
  Booktitle                = {Proceedings of NoDaLiDa 2015},
  Year                     = {2015},
  Pages                    = {163--172},
  Publisher                = {NEALT},

  Url                      = {https://aclweb.org/anthology/W/W15/W15-1821.pdf}
}

{% endraw %}

Changelogs

The data has only seen small changes between the original 1.0 release and the current 1.1 release. These changes fix a small number of annotation problems noticed after the 1.0 release.

--- Machine readable metadata --- Data available since: UD v1.0 License: CC BY-SA 4.0 Includes text: yes Genre: news wiki blog legal fiction grammar-examples Lemmas: manual native UPOS: converted from manual XPOS: converted from manual Features: converted from manual Relations: manual native Contributors: Ginter, Filip; Kanerva, Jenna; Laippala, Veronika; Miekka, Niko; Missilä, Anna; Ojala, Stina; Pyysalo, Sampo Contact: figint@utu.fi, jmnybl@utu.fi Contributing: elsewhere