POS tagging - Githubissues

Kpetyxova commented 4 years ago

Hello! I wondered if you can explain me one thing about your POS tagging. Why do you often mark participles as NOUNs? For example, in this sentence in the third line:

# newdoc id = n01024
# sent_id = n01024013
# text = تقدم النتائج المستخلصة من هذه التجربة تأكيداً إضافياً على أنه يمكن نقل البذور الصاروخية وتخزينها في المحطة الفضائية الدولية لمدة ستة أشهر من دون أن يكون لذلك أي تأثيرات كبيرة على قدرتها على الإنبات والنمو على الأرض.
# original_text = تقدم النتائج المستخلصة من هذه التجربة تأكيدا إضافيا على أنه يمكن نقل البذور الصاروخية وتخزينها في المحطة الفضائية الدولية لمدة ستة أشهر من دون أن يكون لذلك أي تأثيرات كبيرة على قدرتها على الإنبات والنمو على الأرض.
# text_en = The results from this experiment provides further support that rocket seeds can be flown and stored on the International Space Station for six months without having any significant impacts on their ability to germinate and grow on Earth.
1   تقدم    taqad~am_1  VERB    VBC Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Person=3|Tense=Pres|Voice=Act    0   root    _   _
2   النتائج natiyjap_1  NOUN    NN  Animacy=Nhum|Case=Nom|Definite=Def|Gender=Fem|Number=Plur   1   nsubj   _   _
3   المستخلصة   musotaxolaS_1   NOUN    VBN Case=Nom|Definite=Def|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass   2   acl _   _
4   من  min_1   ADP IN  _   6   case    _   _
5   هذه h`*A_1  PRON    PDEM    Case=Gen|Gender=Fem|Number=Sing 6   det:predet  _   _
6   التجربة tajoribap_1 NOUN    NN  Animacy=Nhum|Case=Gen|Definite=Def|Gender=Fem|Number=Sing   3   obl _   _
...

I just don't quite understand, why there is a NOUN tag in this sentence with acl tag (I would rather expect to see ADJ tag there). Thank you in advance!

dan-zeman commented 4 years ago

This seems to be a bug introduced during conversion of the data. The original annotation was provided by Google folks in an annotation scheme that resembles UD but it does not quite follow the UD guidelines; you can see the original annotation here. Some general conversion towards UD-style annotation was done by @martinpopel and myself. The most important contribution however was by @dima-taji who retokenized the data (original tokenization did not comply with the other Arabic treebanks, e.g., al- was treated as a separate determiner word why in fact it should be analyzed as morphological inflection of the noun/adjective) and provided the lemmas using a lemmatizer they have at NYUAD.

In the original annotation, the word مستخلصة was tagged VERB (and there is still the XPOS tag VBN), i.e., it headed a clause, and since it was attached to a NOUN, it would make sense to label the relation acl (it was probably me or @martinpopel who introduced the label, as the original data uses a non-UD relation partmod). After Dima's reassessment of the tokenization and morphology, the word became a NOUN but we forgot to also change the relation to nmod. (That is, if the NOUN tag is correct. I cannot judge that because I don't speak Arabic. If the tag is NOUN, the nmod relation becomes likely, although I cannot exclude that it is still a clause (acl) with a nominal predicate. If the tag is ADJ, then similarly amod becomes likely but again, acl is not automatically excluded.)

amir-zeldes commented 4 years ago

in fact it should be analyzed as morphological inflection of the noun/adjective

I'm sure I've said something like this elsewhere, but I feel like I should bring this point up whenever there's a categorical statement that Arabic articles should not be tokens. The situation in Arabic is more or less exactly the same as in Hebrew and other languages where articles are treated as tokens, so this is a cross-linguistic inconsistency in UD.

I think the argument for not tokenizing them is that they are repeated before each adjective modifying a noun, so they are more like an 'agreement' category. But of course articles are not repeated for things like proper nouns in Arabic, they do not agree in predication (unlike gender/number/case) and many other languages also allow article repetition, sometimes optionally (e.g. Classical Greek), and we do not suggest not tokenizing them there. So I would not say the original Google tokenization was wrong, it was basically the same as in other UD languages which separate articles even if they are repeated in some constructions.

dan-zeman commented 4 years ago

I am not saying the Google tokenization was wrong :-)

I am saying it did not match the tokenization in the other two (pre-existing and larger) Arabic UD treebanks. (I actually did not check NYUAD now but I think this is what I remember, and also @dima-taji maintains NYUAD, so it can be expected to be similar in that respect.)

amir-zeldes commented 4 years ago

Sure, and I'm happy to discuss more about it, I just feel like it's my duty to raise this flag about comparability whenever tokenization of articles in Semitic languages comes up ;)

UniversalDependencies / UD_Arabic-PUD

POS tagging #1