Closed kanayamah closed 3 years ago
Thanks, Hiroshi, for the report. This is a systematic bug that has appeared in several treebanks converted from Google data, including German. All words in multi-word named entities are tagged PROPN
, regardless their real part of speech; this is a violation of the UD guidelines.
In your example, von should be tagged ADP
. Even Dach should be NOUN
and not PROPN
.
(Edit: And yes, the determiner dem was originally also tagged PROPN
. One of my scripts has already fixed the determiner tags but it did not fix the preposition tags.)
Thank you, Dan. Waiting for the fix.
This appears to still be an issue on some words, especially für
https://github.com/stanfordnlp/CoreNLP/issues/1184
You can see some of the possibly incorrect tags I spotted in that issue. I don't actually know if those are correct or not, though.
Fixed in a5cd3e29a19a7ae07990acaaa742c18523133df5.
Awesome, thanks
Generally German multiword tokens are combination of
ADP
andDET
(e.g. "zur" and "beim"), but I foundPROPN
andDET
are tagged for such tokens embedded in proper nouns. For example:If it is a part of
PROPN
, both of "von" and "dem" should be consistently tagged asPROPN
, and if the original PoS should be tagged in decomposed words, "von" should beADP
as long as "dem" isDET
, shouldn't they?There are more than 50 cases in the training corpus.