UniversalDependencies / UD_German-GSD

Other
18 stars 5 forks source link

Capitalization of multiword tokens #7

Closed martinpopel closed 6 years ago

martinpopel commented 7 years ago

For example:

# text = zum Schluß gibt es sogar noch typische chinesische Kitschgeschenke.
1-2     zum     _       _       _       _       _       _       _       _
1       Zu      zu      ADP     APPR    _       3       case    _       _
2       dem     der     DET     ART     Case=Dat|Definite=Def|Gender=Masc,Neut|Number=Sing|PronType=Art 3       det     _       _
...

should be changed to (in UDv2.1)

# text = Zum Schluß gibt es sogar noch typische chinesische Kitschgeschenke.
1-2     Zum     _       _       _       _       _       _       _       _
1       zu      zu      ADP     APPR    _       3       case    _       _
2       dem     der     DET     ART     Case=Dat|Definite=Def|Gender=Masc,Neut|Number=Sing|PronType=Art 3       det     _       _
...

I am not sure if the form of the first word of the multiword token should be capitalized or not. There should be universal guidelines.

dan-zeman commented 6 years ago

I don't think there are universal guidelines for this. In theory, multi-word tokens can be expanded to anything. But I agree that if the writing system distinguishes uppercase and lowercase, then it is desirable to use them correspondingly in the 1-2 line and in the 1 and 2 lines. This is still the FORM column, after all.

Note that the German treebank's domain is user-produced web content and there are sentences that start with lowercase because (presumably) the author wrote them so. Also, the error is not systematic because I found a sentence where the 1-2 line is "Im" and line 1 is is "In" (both capitalized). The above example (zum vs. Zu dem) should be made consistent with itself but it is not clear whether we want to capitalize zum --> Zum (because it is correct) or lowercase Zu --> zu (because we believe that this is what was in the original text).

dan-zeman commented 6 years ago

Fixed in b7fd946daa395f1ac4152e325703991fbc0227b1. In the end, I left the first word capitalized and made the MWT and the sentence text also capitalized.