In this repository, we provide a number of corpora of historical German as well as example workflows for their evaluation for designated research questions. To compensate the lack of annotated data for Old High German, corpora from other older Germanic languages (Old English, Old Saxon, Gothic, Old Norse) are included here, too.
Primary contributions are:
analyses/
)analyses/scrambling
)ReM/full_corpus/
: ReM Treebank / Baumbank MittelhochdeutschTCodex/
: Tatian Corpus of Deviating ExamplesDDB/
: Deutsche Diachrone Baumbank- Old High German (9th c.), Middle High German (13th c.), Early Modern High German (16th c.)
- 8500 tokens *in total*
- annotation layers:
- POS, PARSE, MORPH: manually
- LEMMA: manually (OHG only)
- note: the data is marginal in size and the technical quality is poor (TIGER export for MHG and ENHG is broken -- instead, this is an Exmaralda file without syntax annotation; PAULA export is ok for original TIGER data [all languages], but merging failed [original Exmaralda files point to inexistent token file, however, these contain lemmatization only])
- from the scrambling evaluation, the data is excluded as it contains only two examples of ditransitive verbs with postverbal nominal arguments
ReF/
: ReF Treebank / Referenzkorpus FrühneuhochdeutschENHG/
: Early New High German Treebank by Caitlin LightMercurius/
: Mercurius Treebankfuerstinnen/
: Fuerstinnenkorrespondenz 1.1GerManC/
: GerManC corpusUD/
: UD corporaNote that none of these corpora are balanced.
HDT: Hamburg Dependency Treebank
GSD: German legacy treebank
LIT: German literary history
tuebadz/
: Tueba-D/Zanalysis/scrambling
.This is external data that helps to illuminate the syntax of older West Germanic. Unfortunately, syntactically annotated corpora for Old High German are too sparse, and the witnesses themselves are heavily influenced by (i.e., often literal translations of) Latin, so that external evidence from related language varieties is needed to confirm observations over the sparse OHG data.
helipad/
: HeliPaD corpusThis is a corpus of Old Saxon (Old Low German), based on a single text, only, the Heliand, main witness of the Old Saxon language.
YCOE/
: The York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE)Note that for legal reasons, we provide neither the corpus nor a build script, but an analysis workflow only, as well as its results. For replication, please acquire the corpus.
iswoc/
: ISWOC corpus, Old English subcorpusSyntactically annotated open source edition of 5 major OE texts. Overlaps with YCOE, but uses a different schema. Pipeline and query identical to PROIEL.
- Old English
- 28.000 tokens (different genres)
- annotation layers:
- POS, INFL, LEMMA: semiautomated (?)
- HEAD/EDGE: semiautomated, according to the PROIEL/ISWOC schema
icepahc/
: IcePaHC v.0.9A corpus of historical Icelandic, in analysis, we operate with the Old Icelandic subset (texts prior to 1500):
- Old Norse (Old Icelandic), (12th-16th c.)
- 0.4 million tokens (balanced)
- annotation layers:
- POS: semiautomated (?)
- PARSE: semiautomated (?), annotations and extraction corresponds to those of ENHG
proiel/
: PROIEL, Gothic subcorpusSyntactically annotated edition of the Wulfila Bible.
- Gothic (second half of the 4th c.)
- 56.000 tokens (biblical, but translated from Greek)
- annotation layers:
- POS, INFL, LEMMA: semiautomated (?)
- HEAD/EDGE: semiautomated, according to the PROIEL/ISWOC schema
We provide the workflow for Gothic in analysis/scrambling
, but we excluded its results: Its word order preferences are identical to that of the Greek NT (that we studied for comparison). For postverbal nominal accusative and dative arguments, both show exactly the same preference for ACC>DAT (75.6%=34:11 Greek NT, 76.9%=20:6 Gothic NT; difference may be because the Gothic NT is incomplete).
analyses/
: Case studies over that data
We demonstrate the application of Fintan (resp., CoNLL-RDF and related technologies integrated in Fintan) for complex search and retrieval tasks in real-world research questions in linguistics and the philologies.
analyses/scrambling
: Study diachronic word order of post-verbal oblique argumentsHIPKON: Historisches Predigtenkorpus zum Nachfeld
Historische Syntax des Jiddischen, Version 1.0
Old Saxon corpus
Old Swedish Corpora of Sprakbanken
deprel=""
or deprel="__UNDEF__"
). To be revisited.