Closed mforst closed 4 years ago
+1 for standardizing this, but I'm not sure how feasible it would be to get people to change existing resources. For comparability with other UD treebanks, I would expect articles to be tokenized apart in Arabic as well (as a bonus it also reduces data sparseness in terms of lexical types).
And can they really justify that Maltese patterns with Hebrew rather than Arabic?
Yes: Maltese is Maltese, Arabic is Arabic. I mean, yes, if we're going to standardize things across language families, I agree with @amir-zeldes.
I am afraid that this is a heritage of tokenization traditions in the respective languages. Since its beginning, UD did not put enough stress on normalization of tokenization (as opposed to morphology and syntax), partially because there were pre-existing treebanks in which it was easier to convert tags and relations than to resegment words.
Maltese is different in that the definite morpheme is separated from the stem by a hyphen, so simple low-level tokenizers will tend to treat it as a word. On the other hand, in Arabic and Hebrew you have to split an uninterrupted surface string, and then one may ask why complicate it if there is precedent in other languages that definiteness can be expressed using afixes, and if we have features to capture this. So I would be actually more surprised that Hebrew differs from Arabic, than that Maltese differs from either of the two.
It's true that segmenting articles in Arabic and Hebrew is not trivial, but there is good software out there for doing it. We have this tool for Hebrew and Arabic clitic splitting:
https://github.com/amir-zeldes/RFTokenizer
It gets 98.46% of super tokens split correctly in UD_Arabic-PADT, including article splitting (actually I imagine hardly any of the errors are due to articles, except perhaps in some proper names that are supposed to contain unsplit articles by convention, such as Al-Kuwait). Hebrew is a little harder, but not by much (98.21).
But as @dan-zeman says, I think the problem is much more one of different traditions, so I suspect treebank maintainers may not want to change a standard that they're used to.
I was not so much interested in the technical feasibility of aligning Arabic with the other two languages. Instead, it seems to me that the choice not to split 'Al' in Arabic may well have been a conscious one. (After all, splitting 'Al' is not hard when you already split all the other clitics.) And I was interested in knowing more about the potential reasoning behind that presumably conscious choice.
Here are some considerations I could think of:
1) 'Al', unlike the definite article in Indoeuropean languages, is repeated across NPs consisting of a noun and adjectival modifiers rather than occurring once.
2) 'Al' interacts with the definiteness marking at the end of the following morpheme. (See https://universaldependencies.org/u/feat/Definite.html) Some (diacritized) forms and/or pronunciations are only possible with 'Al', others only without 'Al'.
3) The pronunciation of the consonant in 'Al' depends on the first phoneme of the following morpheme. I'm in a situation where we have to figure out tokenization conventions for Arabic that suit the needs for both Automated Speech Recognition (ASR) and NLU. It seems that NLU could go either way wrt. 'Al', but it seems that splitting off 'Al' makes ASR's life considerably harder.
This being said, @amir-zeldes is, of course, right in stating that splitting off the definite marker "reduces data sparseness in terms of lexical types".
I can verify that the reason for not splitting off the article in Arabic was entirely to do with backwards compatibility with the underlying treebank and standard tools for tokenization and morphological analysis.
@mforst that all makes sense, but I'd like to point out that there are other treebanks with these properties where articles are split:
None of this is to say that Arabic shouldn't have its own tokenization guidelines, but if it's up for discussion, then those reasons alone haven't led to fused article representation in other languages.
I am not an expert on Arabic but one thing that strikes me is that nearly everyone in this thread talks about the morpheme as an “article”. For me, article is a class of words. But the availability of the term in English itself does not mean that we have to see the prefix as a word. It is a morpheme that switches the word to the definite state, i.e., contributes the feature Definite=Def
, just like in other languages (Swedish, Norwegian, Romanian, Bulgarian, for example).
I called it "definite marker" on purpose. I used 'article' only in quotes :)
@amir-zeldes , concerning DOM in Hebrew: Is that really triggered by ה, or is it definiteness (whether marked by ה or implicit, e.g. in proper names) that triggers it?
I don't know if there is a clear-cut rule for drawing the line between a prefix and an article, but for Hebrew it seems relevant that
As far as I know, these properties hold for Arabic as well.
@mforst you are right, DOM is triggered by definiteness, including proper nouns without 'ha'. The fused form [ta] only occurs with common nouns, when the object marker fuses with a potential 'ha'.
@nschneid that's all correct for colloquial Hebrew, though technically point 1 is different for 'normative' Hebrew (as spoken by newscasters and in formal speech), where you get the form 'he' before gutturals (so formal [he-xatul] for informal [ha-xatul], "the cat").
And finally I totally agree with @dan-zeman that definiteness marking could be construed as an inflectional category, especially when it doesn't map well onto segments, but the Semitic case is not so hard to treat as concatenative, and in any case there is an argument to be made that the situation is very similar in all of these languages, so a similar analysis might be expected in UD. For something like Romanian the fact that case and definiteness are expressed on the same morpheme complicates this, but for Arabic, Maltese and Hebrew these things only give definiteness and concatenate in very predictable ways, as pointed out by @nschneid .
A terminological quibble: the term "Arabic" is very broad, so we would be well advised to be precise about which variety were are referring to, whether it's Quranic Arabic (QA), Classical Arabic (CA), Modern Standard Arabic (MSA) or one of the vernacular varieties (which I will just for brevity's sake refer to as Neo-Arabic). Those are not the same thing, especially when it comes to syntax.
@dan-zeman
It is a morpheme that switches the word to the definite state
An excellent point, which is why I used the XPOS tag DEF :)
@mforst
'Al', unlike the definite article in Indoeuropean languages, is repeated across NPs consisting of a noun and adjectival modifiers rather than occurring once.
This is decidedly not true of Maltese (and other varieties Neo-Arabic): by default, the definite article occurs only once in a NOUN ADJ NP, e.g. "L-ilsien Malti" ('the Maltese language'), "L-Unjoni Ewropea". The Classical/Modern Standard Arabic way of determining such NPs (i.e. DEF NOUN DEF ADJ) has a distinct reading, e.g. "Il-Port il-Kbir" refers to a particular great harbor, that between Valletta and the Three Cities; "it-tifel il-kbir" means "the oldest child" and not just "the big child".
Maltese DOM is mostly triggered by [+animacy].
@nschneid the written form of the Arabic definite marker 'Al' does not depend on the following noun/adjective, but the phonological form does, and quite heavily so. And depending on how you represent geminates (/n n/ vs. /n:/, for example), you cannot even split the definite marker from the noun/adjective when those are represented phonemically. E.g., الناس can, of course be split into ال ناس, but what would you do with /? a n: a: s/ ? In this respect, Arabic 'Al' and Hebrew 'ha' seem to be different, so maybe the difference in tokenization conventions concerning the definite marker is actually justifiable.
@bulbulistan I'd think that the above also holds for Maltese, doesn't it? On the other hand, you're certainly right in pointing out that Maltese, unlike MSA, has adjectival modifiers in definite NPs that aren't preceded by the definite marker. (According to https://en.wikipedia.org/wiki/Maltese_language#Adjectives_and_adverbs, it is the historic origin of the adjective that determines whether it is preceded by the definite marker when modifying a definite noun: "Both nouns and adjectives of Semitic origin take the definite article (for example, It-tifel il-kbir, lit. "The boy the elder"="The elder boy"). This rule does not apply to adjectives of Romance origin.")
@amir-zeldes and @nschneid in lightly vowelled MSA and dialectal Arabic (let alone fully vowelled MSA), there are pronunciations that cannot be combined with the definite marker, e.g. /qa:?imat/='list' [Def:Cons]. Only /qa:?ima/ can be made [Def:Def] by means of 'Al'. In the orthography, this is often not visible. It is only with regular plural and dual forms that this is also reflected in the orthography: مهنديسي 'engineers' [Def:Cons], المهنديسي. If I understand you correctly, there are no similar interactions between the definite marker and definiteness-number(-case) suffixes* in Hebrew. If so, that might be another argument why it is OK for the tokenization conventions concerning the definite marker to differ between Hebrew and Arabic.
@mforst As @dan-zeman pointed out, in modern Maltese orthography, all the occurrences of the definite marker (and its fused forms) are separated by a hyphen.
but what would you do with /? a n: a: s/
Why not /ʔan na:s/, as with any assimilation?
The Wiki entry on Maltese adjectives is wrong, as evident from the first example I gave or countless other examples featuring adjectives of Semitic origin like "il-ħajt għoli" ("the high wall"), "il-jiem sbieħ" ("the beautiful days"), "it-triq twila" ("the long road") or "is-somom kbar" ("the big amounts", note the wonderful broken plural of the Romance noun). The standard grammar of Maltese (Albert Borg and Marie Azzopardi-Alexander (Routledge 1997) has little to say on the subject; it basically restricts itself describing DEF NOUN ADJ as an innovation and noting the impossibility of definiteness agreement with some adjectives, but certainly does not mention etymology as a factor. Gatt 2018 is the most recent work on the subject and also does not take etymology into account, noting the optionality of definiteness agreement in general, but its increased likelihood with some adjectives and in some pragmatic contexts.
@bulbulistan , maybe the Maltese definite marker is evolving from a prefix to an article in the Romance sense? Given the heavy influence of Italian/Sicilian and English on Maltese, that wouldn't be too surprising, would it? And maybe the Wiki entry is misleading, but etymology might still be a factor (among several) playing into the repetition of the definite marker before adjectival modifiers?
@mforst
maybe the Maltese definite marker is evolving from a prefix to an article in the Romance sense
It would not surprise me, it would also not be the first time this sort of thing has happened in Maltese. But if so, this has been going on for centuries, as there are 18th century texts which already exhibit this ("dic il meut cherħa u terribili" = "this ugly and terrible death"). Funny how few people noticed this...
And maybe the Wiki entry is misleading
It is not misleading, it is wrong; there is no indication etymology plays any role mentioned in any of the literature, nor does the corpus data bear it out.
@mforst the situation in Hebrew is pretty much the same:
As for representing the geminate form, this doesn't have to be a problem since conllu lets you use different forms for orthographic and syntactic tokens. You can just do:
1-2 annaas 1 al 2 naas
Like French "au" etc.
I am changing the milestone to 2.6 although I think that perhaps the issue can be closed as it does not seem likely that it will lead to a change in the data or in the documentation.
This issue is related to #377, but much more narrow: I'd like to point out that the segmentation decisions regarding the definite marker differ between Semitic languages differ.
On https://universaldependencies.org/ar/index.html, we find this:
• Definite articles are treated as bound morphemes and they are not cut off during tokenization.
On https://universaldependencies.org/he/index.html, however, we find this:
Some function words in Hebrew (commonly known as משהוכלב) are attached as prefixes to the following token, but we represent them as separate tokens: [...] • The definite determiner ה - ‘the’
On https://universaldependencies.org/mt/index.html, finally, we find this:
• Hyphens is a delimiter for the definite article il- and its assimilated forms which include prepositions/case markers with fused definite definite article; these are - for the moment - treated as single tokens.
Are there good reasons for these differences? And can they really justify that Maltese patterns with Hebrew rather than Arabic?