UniversalDependencies / UD_Swedish-LinES

Other
2 stars 2 forks source link

Summary

UD Swedish_LinES is the Swedish half of the LinES Parallel Treebank with UD annotations. All segments are translations from English and the sources cover literary genres, online manuals and Europarl data.

Introduction

UD Swedish_LinES is the Swedish half of the LinES Parallel Treebank with UD annotations. All segments are translations of the corresponding English segments found in the UD English_LinES treebank.The original dependency annotation was first automatically converted to Universal Dependencies and then partially reviewed (Ahrenberg, 2015). In January-February 2017 it was converted to UD version 2 and again reviewed for errors. With version 2.1 lemmata and morphological features have been added.

The treebank is being developed continuously.

Acknowledgements

Three of the source texts were collected as part of the Linköping Translation Corpus Corpus (Merkel, 1999). The treebank was first developed in the project 'Micro- and macro-level analysis of translations' funded by the Swedish Research Council (Ahrenberg, 2007).

Details on the sources

All sub-corpora have English originals with Swedish translations. Six of them are literary works:

Paul Auster: Stad av glas [City of Glass], Tiden, 1995. Translation by Ulla Roseen.

Saul Bellow: Jerusalem tur och retur [To Jerusalem and back: a personal accunt], Bonniers, 1977. Translation by Caj Lundgren.

Joseph Conrad: Mörkrets hjärta [Heart of darkness], Wahlström & Widstrand, Stockholm, 1983. Translation by Margaretha Odelberg.

Nadine Gordimer: Hedersgästen [A Guest of Honour], Bonniers,

  1. Translation by Magnus K:son Lindberg.

J. K. Rowling: Harry Potter och Hemligheternas kammare [Harry Potter and the Chamber of Secrets], Tiden, 2001. Translation by Lena Fries-Gedin.

Jennette Winterson: Vintergatan går genom magen [Gut Symmetries], Bakhåll, 2017. Translation by Ulla Roseen.

In addition the corpus includes segments from Microsoft Access 2002 Online Help and the Swedish part of the Europarl corpus (v.7).

DATA SPLITS

For version 2.0 about 20% of the trees were randomly selected as test set, 20% as development set, and the rest as training set. This partitioning has remained the same since then.

The partition applies in the same way to the English trees so that the order of corresponding trees is the same in the English and Swedish LinES files. The files are named

BASIC STATISTICS

Tree count: 4564 Word count: 79812 Token count: 79812 Dep. relations: 40 of which 7 are language-specific POS tags: 17 Category=value feature pairs: 0

TOKENIZATION

The tokenization is largely based on whitespace, but punctuation marks except word-internal hyphens are treated as separate tokens. The original file also has several multi-word tokens, but these are separated in the UD version with all parts except the first assigned the UD dependency function 'fixed'. No tokens have internal blanks.

MORPHOLOGY

The morphological annotation in the UFEATS column is copied from the UD_Swedish treebank where overlaps occur. For other tokens it has been converted from the morphological information in the original treebank (found in the XPOS column). Nouns are annotated for case, number, species and gender. Verbs are annotated for mood, verb form, tense and diathesis, adjectives for case, degree, definiteness, and number. Pronouns are sub-divided in the morphological description into Personal, Demonstrative, Interrogative, Indefinite, Relative, Total, and Expletive, and are annotated for Case and Number, when relevant.

The mapping from language-specific part-of-speech tags to universal tags was done automatically. There are no other tags than universal tags, but there may be errors.

SYNTAX

The syntactic annotation in the Swedish UD treebank follows the general guidelines but adds some language-specific relations:

The syntactic annotation was first automatically converted from the original LinES annotation scheme as described in Ahrenberg (2015). After conversion to UD version 2.0 the analyses have been reviewed again. Occasional deviations from the guidelines may remain.

REFERENCES

Lars Ahrenberg, 2007. LinES: An English-Swedish Parallel Treebank. Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA, 2007).

Lars Ahrenberg, 2015. Converting an English-Swedish Parallel Treebank to Universal Dependencies. Proceedings of the Third International Conference on Dependency Linguistics (DepLing 2015), Uppsala, August 24-26, 2015, pp. 10-19. ACL Anthology W15-2103.

Magnus Merkel, 1999: Understanding and enhancing translation by parallel text processing. Linköping Studies in Science and Technology, Dissertation No. 607.

Changelog

From version 1.3 to version 2.0 the following changes have been made:

--- Machine readable metadata ---

Data available since: UD v1.3 License: CC BY-NC-SA 4.0 Includes text: yes Genre: fiction nonfiction spoken Lemmas: converted from manual UPOS: converted with corrections XPOS: manual native Features: automatic Relations: converted with corrections Contributors: Ahrenberg, Lars Contributing: elsewhere Contact: lars.ahrenberg@liu.se