Open Kpetyxova opened 4 years ago
This seems to be a bug introduced during conversion of the data. The original annotation was provided by Google folks in an annotation scheme that resembles UD but it does not quite follow the UD guidelines; you can see the original annotation here. Some general conversion towards UD-style annotation was done by @martinpopel and myself. The most important contribution however was by @dima-taji who retokenized the data (original tokenization did not comply with the other Arabic treebanks, e.g., al- was treated as a separate determiner word why in fact it should be analyzed as morphological inflection of the noun/adjective) and provided the lemmas using a lemmatizer they have at NYUAD.
In the original annotation, the word مستخلصة was tagged VERB
(and there is still the XPOS tag VBN
), i.e., it headed a clause, and since it was attached to a NOUN
, it would make sense to label the relation acl
(it was probably me or @martinpopel who introduced the label, as the original data uses a non-UD relation partmod
). After Dima's reassessment of the tokenization and morphology, the word became a NOUN
but we forgot to also change the relation to nmod
. (That is, if the NOUN
tag is correct. I cannot judge that because I don't speak Arabic. If the tag is NOUN
, the nmod
relation becomes likely, although I cannot exclude that it is still a clause (acl
) with a nominal predicate. If the tag is ADJ
, then similarly amod
becomes likely but again, acl
is not automatically excluded.)
in fact it should be analyzed as morphological inflection of the noun/adjective
I'm sure I've said something like this elsewhere, but I feel like I should bring this point up whenever there's a categorical statement that Arabic articles should not be tokens. The situation in Arabic is more or less exactly the same as in Hebrew and other languages where articles are treated as tokens, so this is a cross-linguistic inconsistency in UD.
I think the argument for not tokenizing them is that they are repeated before each adjective modifying a noun, so they are more like an 'agreement' category. But of course articles are not repeated for things like proper nouns in Arabic, they do not agree in predication (unlike gender/number/case) and many other languages also allow article repetition, sometimes optionally (e.g. Classical Greek), and we do not suggest not tokenizing them there. So I would not say the original Google tokenization was wrong, it was basically the same as in other UD languages which separate articles even if they are repeated in some constructions.
I am not saying the Google tokenization was wrong :-)
I am saying it did not match the tokenization in the other two (pre-existing and larger) Arabic UD treebanks. (I actually did not check NYUAD now but I think this is what I remember, and also @dima-taji maintains NYUAD, so it can be expected to be similar in that respect.)
Sure, and I'm happy to discuss more about it, I just feel like it's my duty to raise this flag about comparability whenever tokenization of articles in Semitic languages comes up ;)
Hello! I wondered if you can explain me one thing about your POS tagging. Why do you often mark participles as NOUNs? For example, in this sentence in the third line:
I just don't quite understand, why there is a NOUN tag in this sentence with acl tag (I would rather expect to see ADJ tag there). Thank you in advance!