UniversalDependencies / UD_German-GSD

Other
18 stars 5 forks source link

Severe errors in source texts in sentences from TIGER #12

Open adrianeboyd opened 6 years ago

adrianeboyd commented 6 years ago

There are major errors (words missing, punctuation in the wrong order) in sentences derived from the TIGER treebank. Something seems to have gone terribly wrong in pre-processing steps or in the dependency conversion process.

Looking for nearly identical sentences in UD and TIGER, I can find around 1450 sentences that appear to come from TIGER and about 740 of these contain errors in the source text. Approximately 310 only concern punctuation, while the remaining ~430 additionally involve missing sentence-initial words.

The errors are not equally distributed across the UD subcorpora. They affect 2% of the training corpus, 17% of the development corpus, and 29% (!) of the test corpus. A full half of the sentences with missing initial words are in the test corpus, which means that 22% of sentences in the test corpus are missing the first word in the sentence.

The problems are described in detail below, but here is a quick summary of the distribution of errors:

Subcorpus Punctuation Only Missing Words (and maybe also Punct.)
Train 162 131
Dev 80 85
Test 63 220

Because the errors involve missing and misordered tokens, fixing things would require a fair amount of reannotation. I don't know what is reasonable to do/expect within the constraints of the UD project and obviously some annotation errors and noise are expected in any corpus, but this seems egregious, especially to this degree in the dev/test corpora.

These kinds of artificially ill-formed sentences do not really seem to be representative of German, which is concerning when the UD corpora are being used more and more for development and evaluation. I would at a minimum propose marking the problematic sentences somehow, especially the ones with missing words, so that developers can exclude them as desired.

Problems

The problems I've found:

  1. The first token is missing in many sentences
  2. The order of adjacent sentence-internal punctuation tokens is reversed
  3. Hyphens from compounds have been converted to --
  4. ASCII double quote " (a character that does not appear in TIGER*) is added at the beginning and/or end of a full sentence (where often the first word is missing, too) or appears as a normalization of `` sentence-initially (almost exclusively in the test corpus)
  5. Ordinal numbers are split incorrectly (or at the very least inconsistently) into two tokens (e.g., 22 . Oktober)

Examples

Here is a sentence that shows problems 1-3 (train-s2181):

Chef Andy Grove sieht die größte Herausforderung darin `` , alles zu 
tun , um die Zahl der Nutzer in der PC -- Welt zu steigern '' .

The original sentence from TIGER is:

Ihr Chef Andy Grove sieht die größte Herausforderung darin , `` alles zu 
tun , um die Zahl der Nutzer in der PC-Welt zu steigern '' .

Here is another sentence (test-s544) with problems 3-4:

" verletzt wurde eine Korrespondentin des deutschen ARD -- Fernsehens .

And the original TIGER sentence:

Leicht verletzt wurde eine Korrespondentin des deutschen ARD-Fernsehens .

As you would expect, the missing words often lead to sentences that are not well-formed (test-s545):

" Verletzungen können Zahlungen und Handelserleichterungen künftig 
ausgesetzt werden .

Instead of:

Bei Verletzungen können Zahlungen und Handelserleichterungen künftig 
ausgesetzt werden .

(Verletzungen is annotated as dep and has no morphological features.)

And the even more entertaining (test-s374):

deutscher Touristin muß lebenslang in Haft

Instead of:

Mörder deutscher Touristin muß lebenslang in Haft

(Touristin is indeed annotated as nsubj, deutscher and Touristin are Case=Nom, and deutscher is somehow Degree=cmp,pos!)

Detailed Results

After normalizing emdash "--" vs. "-", ignoring cases that result in matched rather than mismatched quotes, and skipping full sentences that were merely embedded in longer sentences, I have found:

I've attached a summary of the mismatches with the following columns:

  1. Category / Degree (0: only punctuation, 1: punctuation + sentence-initial < ~25 letters, 2: punctuation + sentence-initial > ~25 letters, 3: other overlap)
  2. UD Sentence ID
  3. UD Tokens
  4. TIGER Tokens (hyphenated compounds remain unsplit)

I've removed a number of cases by hand that were accidentally caught by my simple heuristics or that didn't seem problematic (typically a full sentence from a quote within a longer sentence, with an initial list numbering or dash, or with an intro like Auch: or FR: or Richter: or Klartext:). I've left a few cases in categories 1-3 where there are differences in punctuation within an embedded sentence (so they are more like category 0 in effect, which is reflected in the counts in the table in the introduction). I would not be surprised if there are still some errors in this list, either cases that are not problematic or cases from TIGER that I didn't detect.

*To be accurate: ASCII double quotes do appear a few times in TIGER, but they look like mistakes.

ud-tiger-misalignments.csv.txt

dan-zeman commented 6 years ago

Thanks for the detailed analysis! I will look into this issue (although I agree that it may be difficult to fix everything).

dan-zeman commented 6 months ago

Some of the issues may have been fixed in #15 and #16 but we should check #17 (or the source at https://github.com/UniversalDependencies/UD_German-GSD/compare/master...adrianeboyd:UD_German-GSD:mateposfeats-tiger-inserted) for any fixes that did not make it to the dev branch.

Furthermore, there are still 360 instances of FixTigerDep=Yes in the MISC column. Those should be checked manually before closing this issue.