acoli-repo / germhist

Materials for studying syntax in historical German
3 stars 1 forks source link

Syntactic analysis of historical German

In this repository, we provide a number of corpora of historical German as well as example workflows for their evaluation for designated research questions. To compensate the lack of annotated data for Old High German, corpora from other older Germanic languages (Old English, Old Saxon, Gothic, Old Norse) are included here, too.

Primary contributions are:

Table of Contents

ACoLi corpora

ReM/full_corpus/: ReM Treebank / Baumbank Mittelhochdeutsch

External corpora

TCodex/: Tatian Corpus of Deviating Examples

DDB/: Deutsche Diachrone Baumbank

- Old High German (9th c.), Middle High German (13th c.), Early Modern High German (16th c.)
- 8500 tokens *in total*
- annotation layers:
  - POS, PARSE, MORPH: manually
  - LEMMA: manually (OHG only)
- note: the data is marginal in size and the technical quality is poor (TIGER export for MHG and ENHG is broken -- instead, this is an Exmaralda file without syntax annotation; PAULA export is ok for original TIGER data [all languages], but merging failed [original Exmaralda files point to inexistent token file, however, these contain lemmatization only])
- from the scrambling evaluation, the data is excluded as it contains only two examples of ditransitive verbs with postverbal nominal arguments

ReF/: ReF Treebank / Referenzkorpus Frühneuhochdeutsch

ENHG/: Early New High German Treebank by Caitlin Light

Mercurius/: Mercurius Treebank

fuerstinnen/: Fuerstinnenkorrespondenz 1.1

GerManC/: GerManC corpus

UD/: UD corpora

Note that none of these corpora are balanced.

tuebadz/: Tueba-D/Z

Corpora of other older Germanic languages

This is external data that helps to illuminate the syntax of older West Germanic. Unfortunately, syntactically annotated corpora for Old High German are too sparse, and the witnesses themselves are heavily influenced by (i.e., often literal translations of) Latin, so that external evidence from related language varieties is needed to confirm observations over the sparse OHG data.

helipad/: HeliPaD corpus

This is a corpus of Old Saxon (Old Low German), based on a single text, only, the Heliand, main witness of the Old Saxon language.

YCOE/: The York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE)

Note that for legal reasons, we provide neither the corpus nor a build script, but an analysis workflow only, as well as its results. For replication, please acquire the corpus.

iswoc/: ISWOC corpus, Old English subcorpus

Syntactically annotated open source edition of 5 major OE texts. Overlaps with YCOE, but uses a different schema. Pipeline and query identical to PROIEL.

- Old English
- 28.000 tokens (different genres)
- annotation layers:
  - POS, INFL, LEMMA: semiautomated (?)
  - HEAD/EDGE: semiautomated, according to the PROIEL/ISWOC schema

icepahc/: IcePaHC v.0.9

A corpus of historical Icelandic, in analysis, we operate with the Old Icelandic subset (texts prior to 1500):

- Old Norse (Old Icelandic), (12th-16th c.)
- 0.4 million tokens (balanced)
- annotation layers:
  - POS: semiautomated (?)
  - PARSE: semiautomated (?), annotations and extraction corresponds to those of ENHG

proiel/: PROIEL, Gothic subcorpus

Syntactically annotated edition of the Wulfila Bible.

- Gothic (second half of the 4th c.)
- 56.000 tokens (biblical, but translated from Greek)
- annotation layers:
  - POS, INFL, LEMMA: semiautomated (?)
  - HEAD/EDGE: semiautomated, according to the PROIEL/ISWOC schema

We provide the workflow for Gothic in analysis/scrambling, but we excluded its results: Its word order preferences are identical to that of the Greek NT (that we studied for comparison). For postverbal nominal accusative and dative arguments, both show exactly the same preference for ACC>DAT (75.6%=34:11 Greek NT, 76.9%=20:6 Gothic NT; difference may be because the Gothic NT is incomplete).

Analyses

analyses/: Case studies over that data

We demonstrate the application of Fintan (resp., CoNLL-RDF and related technologies integrated in Fintan) for complex search and retrieval tasks in real-world research questions in linguistics and the philologies.

Other candidate corpora