Clitic splitting in Romance languages

dan-zeman commented 8 years ago

Copied from e-mail by @andre-martins : are there guidelines to handle clitic pronouns? Right now there are some inconsistencies. For example,

fazê-lo -> faze-|lo dotá-lo -> dotá-|lo Fê-lo -> Fez-|lo

I think part of the differences come from a different handling of the sentences coming from the European and the Brazilian corpus.

If there are no current guidelines, I propose handling clitics as in the CINTIL Portuguese corpus. The above examples would become:

fazê-lo -> fazê#|-lo dotá-lo -> dotá#|-lo Fê-lo -> Fê#|-lo

Note that the hyphen becomes now part of the clitic and not part of the verb. The # symbol encodes that a consonant (r, s or z) was eliminated from the verb form due to the presence of the clitic. When no consonants are eliminated we have simply:

fazer-se -> fazer|-se fazer-lhe -> fazer|-lhe

For mesoclitics, we have:

far-lhe-ão -> far-CL-ão|-lhe fá-lo-ia -> fá#-CL-ia|-lo

I.e. same as above, with -CL- marking the original position of the clitic. Those rules are easy to encode in a tokenizer.

dan-zeman commented 8 years ago

We do not have any Portuguese-specific guidelines for clitic pronouns yet (because we have no documentation for Portuguese, unfortunately). But we have some general guidelines, and an example from Spanish that we show quite frequently in UD (vámonos = vamos + nos). Now the important difference in Spanish is that no hyphen is used and you really have to split a surface token that cannot be split based on language-independent criteria. The expected annotation is then

1-2    vámonos   _          _
1      vamos     ir         VERB
2      nos       nosotros   PRON

If the correct spelling were vamos-nos, we could also say that it is split to three tokens, based on the hyphen, and the “1-2” line would not be needed. I think that this is the case in Catalan (anem-nos); however, in UD_Catalan 1.3 it is done similarly to Spanish, and the hyphen is kept as a part of the pronoun form, to make the annotation more parallel:

1-2    anem-nos   _      _
1      anem       anar   VERB
2      -nos       -nos   PRON

I was not sure whether we should do the same in Portuguese (and actually I was not aware that we keep the hyphen as part of the pronoun in Catalan until I investigated today; I also doubt that the lemma should be -nos @hectormartinez). But if there are spelling changes on the verb caused by the clitic, then I think it is one reason more for encoding it as a syntactic word, which is part of a multi-word surface token. So I would propose the following (without any technical characters such as #):

1-2    Fê-lo   _       _
1      Fez     fazer   VERB
2      lo      ele     PRON

andre-martins commented 8 years ago

I see your point. I also don't like the technical characters "#" much. So I think the two remaining options are:

1-2    Fê-lo   _       _
1      Fez     fazer   VERB
2      lo      ele     PRON

and

1-2    Fê-lo   _       _
1      Fê     fazer   VERB
2      -lo      ele     PRON

where the first option is closer to Spanish 1.3 and the second one to Catalan 1.3.

For languages where clitics have hyphens (Portuguese, Catalan), I have a preference for the second option, for the following reasons:

It's much simpler to tokenize: for Portuguese (and likely for Catalan as well) a simple regex or a suffix comparison is enough to get the correct splitting (we have a tokenizer for Portuguese in place that does exactly that). No list of verbs is necessary, so this generalizes well to unseen verbs with clitics. I think the first option would require a closed lexicon or a statistical model to decide whether to insert r, s, or z.
The token representation is "less intrusive" in the sense that it's closer to the surface form. On the other hand, the amount of work necessary to turn Fê into Fez is similar to what the lemmatizer needs to do.

For languages where clitics do not have hyphens (e.g. Spanish, Italian) the concerns are a little different, since doing the correct splitting is much less trivial: e.g. in Spanish "darte" is dar+te, but "parte" is just a noun, with no clitic. So there a lexicon of verbs or a statistical model may be unavoidable.

dan-zeman commented 8 years ago

Well, the motivation behind the whole mechanism for “multi-word tokens” is that these splittings may in general require a lexicon and are supposed to be done by a lemmatizer rather than an RE-based tokenizer. So the second option above would make it easier for Portuguese-specific tools, but a language-independent tool cannot rely on it anyway. The option 2 would also mean that we postulate a new word-form of the verb (and, strictly speaking, of the clitic too), that is only licensed in combination with the clitic. (That could be done in Spanish as well and we could say that the word forms are "vámo" and "nos", but I think it is better to show the real word forms that underly the contraction. Yes, it is more difficult because it adds information.)

So I would still slightly prefer option 1 (and if it is used in Portuguese, then I would prefer it in Catalan too). Let's wait a couple of days to see whether others incline towards one or the other option. @ngiordani ? @hectormartinez ? @jnivre ?

jnivre commented 8 years ago

I think this is something that needs to be addressed at the universal level for v2. We have similar stuff going on in many languages, like English "n't" vs "not" for the second element of "don't". Ideally, we should come up with general principles from which the language-specific instantiations can be derived. These principles may or may not refer to the presence of hyphens or apostrophes. Right now, I therefore think the important thing is to adopt a solution that retains enough information so that it can be easily mapped to whatever becomes the standard in v2. Sorry for not being more specific, but my mind is already set at v2, so I regard everything before then as a temporary solution. :)

dan-zeman commented 8 years ago

Good point, @jnivre – we are actually not going to release data before v2, so it does not matter too much what we do in the Portuguese dev branch now. However, @andre-martins has one immediate advantage in favor of option 2 :-) We actually do not have the additional info readily available in the data. So unless André or someone else volunteers to provide the list of verbs and their changes, we cannot do option 1.

andre-martins commented 8 years ago

OK, so I'm going with option 2 for the time being. I'll do a pull request shortly on the UD_Portuguese repo. I'm adding x-y tokens, so it should be easy to switch to option 1 later.

dan-zeman commented 8 years ago

OK, thanks. I am leaving the issue open for discussion. It may be used as input for the universal level discussion, and the universal principles will in turn hopefully lead to a final solution here.

livyreal commented 7 years ago

any decision?

We are also looking to how to treat mesoclitics:

`tratar-se-á` = `tratará' `se`

The clitic comes inside the verb, splitting it into "root"(infitive) + clitic + "inflection". In those cases we tokenize the expression as inflected verb + clitic (above), but maybe there is a better solution. To me, it seems bad to treat it as

`tratar-se-á` = `tratar' `se` `á`

Since á is only an inflection and not a word. My point is we have two words: "tratará" and "se". In which level we will get this if we tokenize each part of the word as a single token?

And it is not clear if (and where) we should keep the hyphen.

andre-martins commented 7 years ago

I agree tratar-se-á = tratar'se`á (3 tokens) looks bad. Doing

tratar-se-á = tratará'se`

and keeping a range token ("tratar-se-á") looks the best solution to me. E.g. in https://github.com/UniversalDependencies/UD_Portuguese/blob/master/pt-ud-train.conllu we have

André

dan-zeman commented 7 years ago

Agreed. tratar-se-á should be treated as one multiword token, consisting of the words tratará and se. Which is an argument for solving the other verb-clitic combinations on the level of multiword tokens as well.

GPPassos commented 7 years ago

Is this still being considered for an universal level treatment?

Naturally we're having this kind of issue at https://github.com/own-pt/bosque-UD and we're considering solving this as in option 1, as Dan suggested. If we're being "intrusive" enough in mesoclisis (as in tratar-se-á), it makes sense being equally "intrusive" at enclisis, for instance marking Fê-lo as Fez + lo, using underlying word forms (what seems to me consistent with considering words as syntactical words) and removing hyphens altogether from forms (keeping only as multiword tokens).

On the other hand, I've still found at the English corpus don't being treated as do + n't instead of do + not.

dan-zeman commented 7 years ago

+1 for decomposing fê-lo as fez + lo. It seems also consistent with our evergreen Spanish example vámonos = vamos + nos (OK, just realized that I am repeating myself here, sorry).

In general, I am afraid that it is not being considered for a universal guideline—not now. First, it is language-specific and it is difficult to make universal claims. Second, it is difficult to throw away established tokenization practices in some languages. My hypothesis is that this is the reason behind not doing do + not in English. I would say that the top priority is to be consistent within a language, then among related languages, then with as many other languages as possible. Convincing the English team to follow suit seems to belong to the third level :-)

martinpopel commented 7 years ago

at the English corpus don't being treated as do + n't instead of do + not.

Lemmas are do + not. This is important for me.

Personally, I prefer if the concatenation of word forms gives the form of the multi-word token. Of course, this is not always possible and other guidelines are needed (and that's the topic of this issue). But in English it is possible.

yoavg commented 7 years ago

fwiw, we had a similar discussion in the Hebrew UD two weeks ago.

https://github.com/UniversalDependencies/UD_Hebrew/issues/10

I think this should definitely be "universalized" at some point, but lets wait for after the v2 release.

On Mon, Jan 30, 2017 at 9:29 PM, Martin Popel notifications@github.com wrote:

at the English corpus don't being treated as do + n't instead of do + not.

Lemmas are do + not. This is important for me.

Personally, I prefer if the concatenation of word forms gives the form of the multi-word token. Of course, this is not always possible and other guidelines are needed (and that's the topic of this issue). But in English it is possible.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/315#issuecomment-276164549, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcVtFbI63Ih0SOkC2TQEjg83AZBCTznks5rXjosgaJpZM4I1tsa .

livyreal commented 7 years ago

For "fê-lo" the lemmas are "fazer" + "ele".

If you want to keep "lo" as a token (and not as a lemma!), it makes sense to keep "fê" as a token too, since those forms only happen in these contexts. It is consistent to treat "fê-lo" as "fê + lo" or "fez + o" ("o" is the accusative form of "ele", as him/he), but if the tokenization is "fez" + "lo", we would be using two different criteria to tokenize each part of the same expression.

I agree with @martinpopel , lemmas are actually what matters and tokens should give (as much as possible, not possible in the case of mesoclitics in Portuguese, e.g.) the multitoken wordform. So I'm for:

1-2    Fê-lo   _       _
1      Fê     fazer   VERB
2     lo      ele     PRON

which seems to be what @andre-martins was arguing so far.

hectormartinez commented 7 years ago

Since we are aiming for harmonization across the Romance treebanks, here's my two cents.

Clitics are (also) hyphenated in Catalan, but in the current version the clitic keeps the hyphen

1 fer fer VERB 2 -ho el PRON

I have to insert the multiword tokens (fer-ho) in this case, but I guess I should also remove the hyphen from the clitics, right?

Longer clitic chains can have hyphens and apostrophes like the reflexive version of the current example, which would be fer-s'ho.

2017-01-31 10:53 GMT+01:00 Livy Real notifications@github.com:

For "fê-lo" the lemmas are "fazer" + "ele".

If you want to keep "lo" as a token (and not as a lemma!), it makes sense to keep "fê" as a token too, since those forms only happen in these contexts. It is consistent to treat "fê-lo" as "fê + lo" or "fez + o" ("o" is the accusative form of "ele", as him/he), but if the tokenization is "fez" + "lo", we would be using two different criteria to tokenize each part of the same expression.

I agree with @martinpopel https://github.com/martinpopel , lemmas are actually what matters and tokens should give (as much as possible, not possible in the case of mesoclitics in Portuguese, e.g.) the multitoken wordform. So I'm for:

1-2 Fê-lo 1 Fê fazer VERB 2 lo ele PRON

which seems to be what @andre-martins https://github.com/andre-martins was arguing so far.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/315#issuecomment-276320383, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlrclX3LQltr4S-mhPssqQIaACkwNKCks5rXwSIgaJpZM4I1tsa .

arademaker commented 7 years ago

For UD v2 in UD_Portuguese we will follow the decomposing fê-lo as fez + lo. We can eventually revise it latter. No hyphens in the parts, only in the multiword token.

UniversalDependencies / docs

Clitic splitting in Romance languages #315