UniversalDependencies / UD_Irish-IDT

Irish data
Other
6 stars 7 forks source link

Splitting multi-word tokens in Irish #142

Open tlynn747 opened 2 years ago

tlynn747 commented 2 years ago

We've come up against this a few times in the past - and while I'm hesitant about splitting, because I can foresee some issues with tokenisation, I'm interested to know if splitting certain tokens has helped the other Celtic Languages in any way...

Specifically I'm referring to: Inflected prepositions: liom, leat, leis... agam, agat, aige... fúm, fút, faoi, etc Inflected copular forms: Más (má + is), sea (is + ea), etc.

Any opinions on the pros and cons of this from a parsing /data management / linguistic perspective? @kscanne @ftyers @colinbatchelor @jheinecke @michealjohnny @eihe

colinbatchelor commented 2 years ago

I haven't done a proper experiment, but it feels as if it makes annotation easier and means that I have to add fewer special cases to the automated checks. Admittedly I could have achieved this by declaring that leis was PRON rather than ADP but that would mean having to add a new feature and isn't consistent with how UD wants to handle these in general anyway.

It does help with fused aspect marker + possessive constructions like (not yet put into corpus sorry) thathar air a bhith gam fàgail "They have been left" where gam divides neatly into ag and an so ag is linked to the VN with case and an with obj.

Where it makes a really big difference is the copular forms 's e and 's ann which are often written se and sann, so all the variants are treated consistently.

I suppose a big caveat is that the only ambiguous words for Scottish Gaelic that the POS tagger might get wrong is air = air and air = air + e which I think I've seen wrong once in ARCOSG. I don't know enough about Irish to know whether there are further ambiguous words of that sort.

tlynn747 commented 2 years ago

Do you feel it makes annotation easier in that you don't need to include additional info in the morph features column?

For our inflected prepositions we have : UPOS=ADP, XPOS=Prep, Feats: Gender=X|Number=N|Person=K

If we split them, the feats would still be captured in the pronoun token. e.g. liom = le (ADP) + UPOS=PRON, XPOS=Pers, Feats: Number=Sing|Person=1

On a similar note, article-inflected prepositions are labelled UPOS=ADP, XPOS=Art, Feats: Number=X|PronType=Art e.g. san would become i (ADP) + an (DET)

Regular prepositions are UPOS=ADP, XPOS=Simp (simple)/ Cmpd (compound)

Yes in Irish there are ambiguous ones, that's one of the factors that makes me hesitant, from the standpoint of tokeniser/tagger training: leis can mean 'with him/it' or 'with' (selecting for a prepositional object) ann can mean 'there' or 'in it' faoi can mean 'about/under him' or 'about/under' (selecting for a prepositional object) I don't have an extensive list to hand, but they're frequent enough - in both senses - to cause issues.

I can see the benefit of splitting the copula forms, although the theoretical linguistic research on the morphology split is not so solid. Maybe the Irish version of the Christian Brothers Grammar is more useful here than the English one!

jheinecke commented 2 years ago

In Welsh the situation is more complicated, at least for inflected prepositions: They can but need not necessarily be followed by the corresponding pronoun, so you can have Mae o'n meddwl amdanat "He thinks of you" alongside Mae o'n meddwl amdanat ti. After a discussion here on github, it was decided to annotate them as multi-word tokens

1   Mae bod 0   root
2   o   o   1   nsubj
3   'n  yn  1   aux
4   meddwl  meddwl  1   xcomp
5-6 amdanat _   _   _
5   am  am  5   case
6   ti  ti  4   obl
7   ti  ti  6   compound:redup

if the overt pronoun is absent the annotation looks like

1   Mae bod 0   root
2   o   o   1   nsubj
3   'n  yn  1   aux
4   meddwl  meddwl  1   xcomp
5-6 amdanat _   _   _
5   am  am  5   case
6   ti  ti  4   obl

The compound:redup may not look nice, but at least the rest of the tree is identical in both cases. If we had the inflected prepositions not annotated as MWT, than we would have different trees depending on the presence of the (optional) pronoun. Since the inflected pronouns are MWTs, Person, Number (and Gender) are marked on the pronoun.

Tokenization (such as Udpipe 1.2) works fine with this as does tagging and parsing (using UdpipeFuture + XLM-Roberta)

kscanne commented 2 years ago

Hi Teresa, we chatted about this already but I'll record my thoughts here also. I chose to split the inflected prepositions for Manx based on the intuition that it would result in better parsing accuracy (I think I discussed with Fran as well), but like Colin I haven't done the experiment to verify this intuition.

Colin's comment about fewer special cases in the QA scripts is really important too, I think. My attempt at QA for the Goidelic languages is here:

https://github.com/kscanne/grammatach/tree/main/grammatach

I claim the total length of this code is a rough proxy for how cleanly the treebank is capturing the grammar of the language (shorter is better). This is why I was in favor of having fewer "fixed" and "compound" relations for Irish... each of those needs to be treated as a special case. Same deal with the inflected prepositions... by splitting them in Manx, more or less every ADP has a nominal dependent with relation "case". It's only "more or less" because I originally used "fixed" for a few set phrases, which I now regret! (And will probably fix for the next release).

Like Irish, Manx has a handful of ambiguous cases (for example, Manx "er" corresponds to Irish "ar" and the 3rd person masculine "air") and this means the tokenizer needs to be smart. The models I've trained with UDPipe do pretty well, but aren't perfect, and it's a bit of a pain to clean up these multiword tokens when post-editing UDPipe output (since renumbering of tokens is required).

jheinecke commented 2 years ago

Thanks Kevin for you post on Manx, this reminded me that I forgot to mention ambiguous forms: in fact, I'm not aware of inflected prepositions in Welsh which are homographs of other words, so tokenisation is not a problem (That does not mean that there aren't any at all...)

ftyers commented 2 years ago

Hey there (sorry for the delayed response)! For Breton, in terms of parsing performance I haven't tried it out. In general I like splitting them, in many cases they look like contractions to me. But a parsing experiment would be very interesting to try out.

michealjohnny commented 2 years ago

Heya, also apoligising for coming to this late.

While I would typical agree about not adding too many exceptions or special cases to codebases this is a closed set so at least the whole set can be done; as long as the aforementioned ambiguous forms are catered for (e.g faoi).

With regards to a parsing experiment, I couldn't agree more. Ideally (I believe) we want people of all abilities to be able to use the parsed data for whatever purpose - be it linguistic analysis, NLP tasks, other - without them needing to have too much prior knowledge on dealing "gotchas" or specific ways of handling prepositions.

The only other thing I would say is the same thing I say to myself with my corpus processing and analysis: can I do this at scale? e.g. Don't rely on too many manual checks, or post-edits, or whatever.