UniversalDependencies / UD_Czech-CAC

Data from the Czech Academic Corpus.
Other
1 stars 0 forks source link

Summary

The UD_Czech-CAC treebank is based on the Czech Academic Corpus 2.0 (CAC; Český akademický korpus; ČAK), created at Charles University in Prague.

Introduction

CAC consists both of written data and transcripts of spoken language. Only the written part is included in this treebank as no syntactic annotation is available for the spoken data. Out of 650,000 total CAC tokens, 493,306 appear in the treebank.

The first version of CAC was created by a team from the Institute of the Czech Language, Czechoslovak Academy of Sciences, led by Marie Těšitelová, in 1971-1985; its original name was “Korpus věcného stylu”. It was reshaped and made compatible with the Prague Dependency Treebank between 2007 (CAC 1.0) and 2008 (CAC 2.0); these corpora are distributed by the Linguistic Data Consortium. The corpus has now been converted to Universal Dependencies and made freely available under the CreativeCommons license (see LICENSE.txt).

See the following websites for more information on CAC 2.0:

CAC contains mostly unabridged articles taken from a wide range of media. These articles include newspapers, magazines and other sources covering administration, journalism and scientific fields. These three genres can be distinguished by the sentence id: in

# sent_id = a-s20w-s55

the "s20w" part identifies the source document, where "w" means "written", "20" is the document id number and "s" means scientific (while "a20w" is the twentieth document from the administrative genre, and "n20w" from newspapers).

The texts are taken from the 1970s and 1980s and thus, the selection of texts is influenced by the political and cultural climate of this time period.

The original data in “Korpus věcného stylu” omits all punctuation symbols, numbers (only those expressed in digits; numeral words are preserved) and some measure units and symbols. The missing tokens were manually restored in CAC; however, for numbers and units, only a wildcard was inserted as the exact value and form cannot be guessed without access to the primary document.

Acknowledgments

We wish to thank all of the contributors to the original annotation effort, as well as the team responsible for the corpus' revival in 2008.

References

Changelog

=== Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: UD v1.3 License: CC BY-SA 4.0 Includes text: yes Genre: news nonfiction legal reviews medical Lemmas: converted from manual UPOS: converted from manual XPOS: manual native Features: converted from manual Relations: converted from manual Contributors: Hladká, Barbora; Zeman, Daniel Contributing: elsewhere Contact: zeman@ufal.mff.cuni.cz

Original CAC authors: Hladká, Barbora; Hajič, Jan; Hana, Jiří; Hlaváčová, Jaroslava; Mírovský, Jiří; Raab, Jan Original KVS authors: Těšitelová, Marie