Tokenization standard difference with GSD

AngledLuffa commented 4 months ago

I found a tokenization difference between the Spanish datasets which makes them somewhat incompatible. If the clitics cause a word to acquire an accent, in GSD the pieces keep the accent, whereas in AnCora the pieces do not. In PUD the accents also remain. Would be great to unify them, perhaps by changing AnCora to keep the accents:

GSD

# sent_id = es-train-003-s271
# text = Jacob, desempleado por una discusión que tuvo con Bretton James, y sabiendo que Winnie está esperando un hijo suyo, decide persuadir a Winnie de liberar el fideicomiso, para depositárselo a Gordon Gekko quien le ha prometido usarlos para consolidar una fortuna para Winnie y él.
33-35   depositárselo   _       _       _       _       _       _       _       _
33      depositár       depositar       VERB    _       VerbForm=Inf    28      advcl   _       _
34      se      él      PRON    _       Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes      33      expl:pv _       _
35      lo      él      PRON    _       Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs     33      obj     _       _

AnCora

# sent_id = 3LB-CAST-a12-3-s7
# text = Luego se deciden a afeitárselo y entonces se dan cuenta de cuál es su verdadera carencia la carencia de bigote.
# orig_file_sentence 007#17
5-7     afeitárselo     _       _       _       _       _       _       _       _
5       afeitar afeitar VERB    vmn0000 VerbForm=Inf    3       xcomp   3:xcomp ArgTem=arg1:tem
6       se      él      PRON    _       Case=Dat|Person=3|PrepCase=Npr|PronType=Prs     5       obl:arg 5:obl:arg       _
7       lo      él      PRON    _       Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs     5       obj     5:obj   _

PUD

# newdoc id = n01040
# sent_id = n01040028
# text = Estudiantes como Rai han estado reuniéndose con consejeros en el colegio para hablar de lo que pasó, pero esta dice que el mayor consuelo lo obtiene viendo a sus amigos.
# text_en = Students like Rai have been meeting with counsellors at the school to talk about what happened, but she said the biggest comfort has come from seeing her friends.
6-7     reuniéndose     _       _       _       _       _       _       _       _
6       reuniéndo       reunir  VERB    VBG     VerbForm=Ger    0       root    _       _
7       se      él      PRON    SE      Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes      6       compound:prt    _       _

AngledLuffa commented 4 months ago

... although I mention adding the accents for AnCora, removing them from GSD and PUD would be just as satisfactory

dan-zeman commented 4 months ago

The rule in Spanish (and generally in the original UD guidelines, until English went its own way :-)) is that the pieces have the FORM that they would have if they occurred as orthographic words. Therefore, the preferable solution is to remove the accents from GSD and PUD.

AngledLuffa commented 4 months ago

Therefore, the preferable solution is to remove the accents from GSD and PUD.

Is that a viable solution, or would that be a weird modification to someone else's treebank?

amir-zeldes commented 4 months ago

The rule in Spanish (and generally in the original UD guidelines, until English went its own way :-)) is that the pieces have the FORM that they would have if they occurred as orthographic words.

I don't think English was the first (or the last) to prefer MWTs with tokens that sum up to the whole. For example, UD_Arabic-PADT is older and keeps clitic pronouns in their orthographic form inside MWTs (ه does not become هُوَ). We copied that in UD_Hebrew-IAHLTwiki, and UD_Coptic is the same.

I think if there is a way to make MWT sub-tokenization concatenative, it is by far preferable, since it makes tokenization much easier in that language. I don't really see the value in removing the accents given that we also have lemmatization. It just means you now have to use a seq2seq tokenizer and have more chances of errors. Of course in some cases it's inevitable, since you don't have a viable segmentation (e.g. Portuguese "a" can contain two tokens, so that's no solvable using concatenative MWTs), but if it's fairly easy to do I would absolutely prefer that in any language I work with.

Disclaimer: I don't really work much with UD Spanish and there may be really important existing tools or other corpora that favor stripping the accents, in which case that needs to be considered of course.

AngledLuffa commented 4 months ago

I think if there is a way to make MWT sub-tokenization concatenative, it is by far preferable, since it makes tokenization much easier in that language.

Agreed, which is why I posted this issue here instead of GSD & PUD, since AnCora is the one which would change under that scheme. Although for Spanish, tokens such as del would be hard to make concatenative.

(Even in English we have some examples that are a bit of a stretch, such as gonna.)

dan-zeman commented 4 months ago

I don't think English was the first (or the last) to prefer MWTs with tokens that sum up to the whole. For example, UD_Arabic-PADT is older and keeps clitic pronouns in their orthographic form inside MWTs (ه does not become هُوَ).

True. In the case of PADT, it is legacy tokenization from the pre-UD version of PADT. Whenever an orthographic word was split into multiple nodes in the original PADT, it was reflected as a multiword token in UD, without trying to revisit the rules and possibly adjust them in the UD spirit. Partly for time/capacity reasons, partly simply because the person doing the conversion (me) did not possess the necessary knowledge to even spot the issue. (Besides, I am not sure that هُوَ would be my preferred solution when I think of it now. I am still no expert on Arabic but it seems to me that this form is nominative while the required form would be accusative. It is possible that the expected form (paradigm slot) never occurs as a free form in the language; in such cases, taking a substring of the surface token is probably the only option.)

dan-zeman commented 4 months ago

Note for myself: COSER sides with AnCora:

# sent_id = astu-480
# text = Lo que no pasó este año por aquí, preguntándome por la capilla yo creo que fueron pa la playa todos pero tenía que enseñales desde aquí por donde tenían que ir pa ya pero creo que taba así to l día.
10-11   preguntándome   _   _   _   _   _   _   _   _
10  preguntando preguntar   VERB    _   VerbForm=Ger    4   advcl   _   _
11  me  yo  PRON    pc1cs000    Case=Dat|Number=Sing|Person=1|PrepCase=Npr|PronType=Prs 10  expl:pv _   _
12  por por ADP sps00   _   14  case    _   _
13  la  el  DET da0fs0  Definite=Def|Gender=Fem|Number=Sing|PronType=Art    14  det _   _
14  capilla capilla NOUN    ncfs000 Gender=Fem|Number=Sing  10  obl _   _

AngledLuffa commented 4 months ago

Note for myself: COSER sides with AnCora

Fair point. I had only checked the three I mentioned.

Would it be valid to rewrite the forms to match one or the other standard? I do agree with @amir-zeldes that keeping accents is preferable (bearing in mind my opinion is an engineering opinion, not a linguistic opinion) but my only really strong desire is to see them be unified somehow.

amir-zeldes commented 4 months ago

I am not sure that هُوَ would be my preferred solution when I think of it now. I am still no expert on Arabic but it seems to me that this form is nominative while the required form would be accusative

This is a tricky position, because some environments are "MWT only", like you say. But arguably this is true even of the paradigm examples such as Romance article fusion: if we require a masculine article such that it is governed by "à" then it can only be "u" (which historically is true, it's just L-vocalization).

So the question is one of granularity - how specific we take the required form's environment to be. If we don't want specific prepositions to be an environment and we say it's "accusative" or deprel=obj, we could also think that the canonical Standard Arabic independent accusative pronoun is إياه and start putting that into every MWT with a clitic object. To be clear, I don't think this is the right thing to do at all. Ultimately, this kind of analysis seems to import much more assumptions and complexity into the treebank, whereas leaving the ه as is is both easy and better for engineering reasons.

dan-zeman commented 4 months ago

Fixed UD_Spanish-PUD in https://github.com/UniversalDependencies/UD_Spanish-PUD/commit/36178ffe9813eefcebfeac3a977e77a96dab61af. It turns out it was already mostly in line with AnCora. Using

[áéí](r|ndo)(me|te|se|l[eoa]s?|nos|os){1,2}\t

I found 17 gerunds with clitics. Out of them, 16 were already good and only reuniéndo had to be fixed.

AngledLuffa commented 4 months ago

That's kind of funny, that I was manually searching and the first one I came across the one which was different from all the others.

So it sounds like changing GSD would be the more accepted solution? Is that something we can do? Is that something I would need to do?

dan-zeman commented 4 months ago

I am looking into it. There are 235 instances in GSD. Many of them are already in line with AnCora, too, some are not. But 15 of them have not been even segmented.

AngledLuffa commented 4 months ago

Thank you! LMK if you want me to take on any part of it

On Wed, Jun 19, 2024, 12:14 PM Dan Zeman @.***> wrote:

I am looking into it. There are 235 instances in GSD. Many of them are already in line with AnCora, too, some are not. But 15 of them have not been even segmented.

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_Spanish-AnCora/issues/9#issuecomment-2179338417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKH6LJFNAEFCVOFUMDZIHKAXAVCNFSM6AAAAABJRNBWM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGMZTQNBRG4 . You are receiving this because you authored the thread.Message ID: @.***>

dan-zeman commented 4 months ago

Done. In the end the "many are already in line" claim held for dev and test data, while almost all instances in train had to be fixed.

AngledLuffa commented 4 months ago

Thank you, this will greatly improve the interoperability of the treebanks.

UniversalDependencies / UD_Spanish-AnCora

Tokenization standard difference with GSD #9