UniversalDependencies / UD_Latvian-LVTB

Creative Commons Attribution Share Alike 4.0 International
2 stars 2 forks source link

Summary

Latvian UD Treebank is based on Latvian Treebank (LVTB), being created at University of Latvia, Institute of Mathematics and Computer Science, Artificial Intelligence Laboratory.

Introduction

Latvian UD Treebank v2.14 consists of 18'850 sentences (317'369 tokens), and it has been obtained by automatic conversion of both the morphological and the syntactic annotations of the original LVTB treebank. LVTB data contains manually verified syntactic annotation according to a hybrid dependency-constituency schema, as well as manually verified morphological tags and lemmas. LVTB is released in parallel with Latvian UD Treebank since v2.2 and features the same version numbers. The corresponding LVTB versions are listed here. Key LvtbNodeId in Latvian UD Treebank CoNLL-U field MISC provides the mapping from Latvian UD Treebank to LVTB. Each LVTB version is superset of the corresponding Latvian UD Treebank version in terms of included sentences.

Acknowledgments

This work was supported by European Regional Development Fund under the grant agreement No. 1.1.1.1/16/A/219 (Full Stack of Language Resources for Natural Language Understanding and Generation in Latvian) in synergy with the grant agreement No. 1.1.1.2/VIAA/1/16/188. The pilot project was supported by State Research Programme "National Identity". The work was continued within the State Research Programme "Digital Resources for the Humanities" under the grant agreement No. VPP-IZM-DH-2020/1-0001, and now is continued in State Research Programme "Research on Modern Latvian Language and Development of Language Technology" under the grant agreement No. VPP-LETONIKA-2021/1-0006.

References

Licensing

This data set is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

By using this data set, you agree to comply with the European Intellectual Property Rights and the European General Data Protection Regulation.

Please, let us know if you use this data set for product or service development.

Data splits

The training data covers various text types: news, fiction, academic texts, legal texts, transcripts of spoken language, etc. The development and test sets are carefully split out to cover all those types.

Train: 14358 sentences\ Dev: 2080 sentences\ Test: 2412 sentences

Changelog

2024-05-15 v2.14

2023-11-15 v2.13

2023-05-15 v2.12

2022-11-15 v2.11

2022-05-15 v2.10

2021-11-15 v2.9

2021-05-15 v2.8

2020-05-15 v2.6

2019-11-15 v2.5

2019-05-15 v2.4

2018-11-15 v2.3

2018-04-15 v2.2

2017-11-15 v2.1

2017-02-15 v2.0

2016-11-15 v1.4

=== Machine-readable metadata =================================================
Data available since: UD v1.3
License: CC BY-SA 4.0
Includes text: yes
Genre: news fiction legal spoken academic
Lemmas: manual native
UPOS: converted from manual
XPOS: manual native
Features: converted from manual
Relations: converted from manual
Contributors: Pretkalniņa, Lauma; Rituma, Laura; Saulīte, Baiba; Nešpore-Bērzkalne, Gunta; Grūzītis, Normunds
Contributing: elsewhere
Contact: lauma@ailab.lv, normunds@ailab.lv
===============================================================================