UniversalDependencies / UD_German-GSD

Other
18 stars 5 forks source link

PROPN in multiword tokens #9

Closed kanayamah closed 3 years ago

kanayamah commented 7 years ago

Generally German multiword tokens are combination of ADP and DET (e.g. "zur" and "beim"), but I found PROPN and DET are tagged for such tokens embedded in proper nouns. For example:

18  Karlsson    Karlsson    PROPN   NN  Case=Acc|Gender=Masc|Number=Sing    22  obj _   _
19-20   vom _   _   _   _   _   _   _   _
19  von von PROPN   APPR    _   21  case    _   _
20  dem der DET ART Case=Dat|Definite=Def|Gender=Masc,Neut|Number=Sing|PronType=Art 21  det _   _
21  Dach    Dach    PROPN   NN  Case=Dat|Gender=Masc,Neut|Number=Sing   18  nmod    _   _

If it is a part of PROPN, both of "von" and "dem" should be consistently tagged as PROPN, and if the original PoS should be tagged in decomposed words, "von" should be ADP as long as "dem" is DET, shouldn't they?
There are more than 50 cases in the training corpus.

dan-zeman commented 7 years ago

Thanks, Hiroshi, for the report. This is a systematic bug that has appeared in several treebanks converted from Google data, including German. All words in multi-word named entities are tagged PROPN, regardless their real part of speech; this is a violation of the UD guidelines.

In your example, von should be tagged ADP. Even Dach should be NOUN and not PROPN.

(Edit: And yes, the determiner dem was originally also tagged PROPN. One of my scripts has already fixed the determiner tags but it did not fix the preposition tags.)

kanayamah commented 7 years ago

Thank you, Dan. Waiting for the fix.

AngledLuffa commented 3 years ago

This appears to still be an issue on some words, especially für

https://github.com/stanfordnlp/CoreNLP/issues/1184

You can see some of the possibly incorrect tags I spotted in that issue. I don't actually know if those are correct or not, though.

dan-zeman commented 3 years ago

Fixed in a5cd3e29a19a7ae07990acaaa742c18523133df5.

AngledLuffa commented 3 years ago

Awesome, thanks