Closed martinpopel closed 6 years ago
I don't think there are universal guidelines for this. In theory, multi-word tokens can be expanded to anything. But I agree that if the writing system distinguishes uppercase and lowercase, then it is desirable to use them correspondingly in the 1-2 line and in the 1 and 2 lines. This is still the FORM column, after all.
Note that the German treebank's domain is user-produced web content and there are sentences that start with lowercase because (presumably) the author wrote them so. Also, the error is not systematic because I found a sentence where the 1-2 line is "Im" and line 1 is is "In" (both capitalized). The above example (zum vs. Zu dem) should be made consistent with itself but it is not clear whether we want to capitalize zum --> Zum (because it is correct) or lowercase Zu --> zu (because we believe that this is what was in the original text).
Fixed in b7fd946daa395f1ac4152e325703991fbc0227b1. In the end, I left the first word capitalized and made the MWT and the sentence text also capitalized.
For example:
should be changed to (in UDv2.1)
I am not sure if the form of the first word of the multiword token should be capitalized or not. There should be universal guidelines.