Closed AngledLuffa closed 4 months ago
... although I mention adding the accents for AnCora, removing them from GSD and PUD would be just as satisfactory
The rule in Spanish (and generally in the original UD guidelines, until English went its own way :-)) is that the pieces have the FORM that they would have if they occurred as orthographic words. Therefore, the preferable solution is to remove the accents from GSD and PUD.
Therefore, the preferable solution is to remove the accents from GSD and PUD.
Is that a viable solution, or would that be a weird modification to someone else's treebank?
The rule in Spanish (and generally in the original UD guidelines, until English went its own way :-)) is that the pieces have the FORM that they would have if they occurred as orthographic words.
I don't think English was the first (or the last) to prefer MWTs with tokens that sum up to the whole. For example, UD_Arabic-PADT is older and keeps clitic pronouns in their orthographic form inside MWTs (ه does not become هُوَ). We copied that in UD_Hebrew-IAHLTwiki, and UD_Coptic is the same.
I think if there is a way to make MWT sub-tokenization concatenative, it is by far preferable, since it makes tokenization much easier in that language. I don't really see the value in removing the accents given that we also have lemmatization. It just means you now have to use a seq2seq tokenizer and have more chances of errors. Of course in some cases it's inevitable, since you don't have a viable segmentation (e.g. Portuguese "a" can contain two tokens, so that's no solvable using concatenative MWTs), but if it's fairly easy to do I would absolutely prefer that in any language I work with.
Disclaimer: I don't really work much with UD Spanish and there may be really important existing tools or other corpora that favor stripping the accents, in which case that needs to be considered of course.
I think if there is a way to make MWT sub-tokenization concatenative, it is by far preferable, since it makes tokenization much easier in that language.
Agreed, which is why I posted this issue here instead of GSD & PUD, since AnCora is the one which would change under that scheme. Although for Spanish, tokens such as del
would be hard to make concatenative.
(Even in English we have some examples that are a bit of a stretch, such as gonna
.)
I don't think English was the first (or the last) to prefer MWTs with tokens that sum up to the whole. For example, UD_Arabic-PADT is older and keeps clitic pronouns in their orthographic form inside MWTs (ه does not become هُوَ).
True. In the case of PADT, it is legacy tokenization from the pre-UD version of PADT. Whenever an orthographic word was split into multiple nodes in the original PADT, it was reflected as a multiword token in UD, without trying to revisit the rules and possibly adjust them in the UD spirit. Partly for time/capacity reasons, partly simply because the person doing the conversion (me) did not possess the necessary knowledge to even spot the issue. (Besides, I am not sure that هُوَ would be my preferred solution when I think of it now. I am still no expert on Arabic but it seems to me that this form is nominative while the required form would be accusative. It is possible that the expected form (paradigm slot) never occurs as a free form in the language; in such cases, taking a substring of the surface token is probably the only option.)
Note for myself: COSER sides with AnCora:
# sent_id = astu-480
# text = Lo que no pasó este año por aquí, preguntándome por la capilla yo creo que fueron pa la playa todos pero tenía que enseñales desde aquí por donde tenían que ir pa ya pero creo que taba así to l día.
10-11 preguntándome _ _ _ _ _ _ _ _
10 preguntando preguntar VERB _ VerbForm=Ger 4 advcl _ _
11 me yo PRON pc1cs000 Case=Dat|Number=Sing|Person=1|PrepCase=Npr|PronType=Prs 10 expl:pv _ _
12 por por ADP sps00 _ 14 case _ _
13 la el DET da0fs0 Definite=Def|Gender=Fem|Number=Sing|PronType=Art 14 det _ _
14 capilla capilla NOUN ncfs000 Gender=Fem|Number=Sing 10 obl _ _
Note for myself: COSER sides with AnCora
Fair point. I had only checked the three I mentioned.
Would it be valid to rewrite the forms to match one or the other standard? I do agree with @amir-zeldes that keeping accents is preferable (bearing in mind my opinion is an engineering opinion, not a linguistic opinion) but my only really strong desire is to see them be unified somehow.
I am not sure that هُوَ would be my preferred solution when I think of it now. I am still no expert on Arabic but it seems to me that this form is nominative while the required form would be accusative
This is a tricky position, because some environments are "MWT only", like you say. But arguably this is true even of the paradigm examples such as Romance article fusion: if we require a masculine article such that it is governed by "à" then it can only be "u" (which historically is true, it's just L-vocalization).
So the question is one of granularity - how specific we take the required form's environment to be. If we don't want specific prepositions to be an environment and we say it's "accusative" or deprel=obj
, we could also think that the canonical Standard Arabic independent accusative pronoun is إياه and start putting that into every MWT with a clitic object. To be clear, I don't think this is the right thing to do at all. Ultimately, this kind of analysis seems to import much more assumptions and complexity into the treebank, whereas leaving the ه as is is both easy and better for engineering reasons.
Fixed UD_Spanish-PUD in https://github.com/UniversalDependencies/UD_Spanish-PUD/commit/36178ffe9813eefcebfeac3a977e77a96dab61af. It turns out it was already mostly in line with AnCora. Using
[áéí](r|ndo)(me|te|se|l[eoa]s?|nos|os){1,2}\t
I found 17 gerunds with clitics. Out of them, 16 were already good and only reuniéndo had to be fixed.
That's kind of funny, that I was manually searching and the first one I came across the one which was different from all the others.
So it sounds like changing GSD would be the more accepted solution? Is that something we can do? Is that something I would need to do?
I am looking into it. There are 235 instances in GSD. Many of them are already in line with AnCora, too, some are not. But 15 of them have not been even segmented.
Thank you! LMK if you want me to take on any part of it
On Wed, Jun 19, 2024, 12:14 PM Dan Zeman @.***> wrote:
I am looking into it. There are 235 instances in GSD. Many of them are already in line with AnCora, too, some are not. But 15 of them have not been even segmented.
— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_Spanish-AnCora/issues/9#issuecomment-2179338417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKH6LJFNAEFCVOFUMDZIHKAXAVCNFSM6AAAAABJRNBWM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGMZTQNBRG4 . You are receiving this because you authored the thread.Message ID: @.***>
Done. In the end the "many are already in line" claim held for dev and test data, while almost all instances in train had to be fixed.
Thank you, this will greatly improve the interoperability of the treebanks.
I found a tokenization difference between the Spanish datasets which makes them somewhat incompatible. If the clitics cause a word to acquire an accent, in GSD the pieces keep the accent, whereas in AnCora the pieces do not. In PUD the accents also remain. Would be great to unify them, perhaps by changing AnCora to keep the accents:
GSD
AnCora
PUD