UniversalDependencies / UD_Irish-IDT

Irish data
Other
6 stars 7 forks source link

Lemma of 'na' #85

Closed laurenCassidy closed 3 years ago

laurenCassidy commented 3 years ago

The Irish definite article encodes information about number, gender and case. Should this be normalised in the lemma column? (i.e. an, na -> an) There is no consensus among UD treebanks. From a quick scan of some of the languages that have similar morphological information in the definite article:

Breton an, al, ar -> an Catalan: el, la, l', els, les -> el French: le, la, l', les -> le Greek: ο, η, το, οι, οι, τα -> ο Scottish Gaelic: an, na -> an Spanish : el, la, los las -> el

However, in the Manx and some German treebanks each type of article has a distinct lemma.

kscanne commented 3 years ago

For what it's worth, I followed the Irish example when deciding on this for Manx.

tlynn747 commented 3 years ago

For what it's worth, I followed the Irish example when deciding on this for Manx.

I had suspected that :)

I took lead from Elaine's NCI POS-tagged text. I'm not opposed to changing the lemma to 'an'. It makes sense given the morphological features and would be better to keep cross-lingual consistency .

kscanne commented 3 years ago

Would be better to align with gd certainly, and I'll switch Manx as well once the change is made here.

Note there's exactly one "na" in the corpus that's not DET in sentence 1629... that lemma should remain as "i".

kscanne commented 3 years ago

@laurenCassidy: do you think you'll try making this change soon? I'm reluctant to keep progressing on the noun features until this is sorted out, for fear of a huge merge conflict!

laurenCassidy commented 3 years ago

Thanks @kscanne I can hopefully make the change today - I had to check the correct way to do it as I have never done it before. So my plan is to fork, make the changes and then submit a pull request... If I have any problems I will let you know so that you can go ahead and I can try again when you are finished!

laurenCassidy commented 3 years ago

@kscanne I didn't get a chance to do this today so you can go ahead with your changes and let me know when you are done :) thanks

kscanne commented 3 years ago

I'll go ahead and make this change once my next PR is merged.

kscanne commented 3 years ago

Maybe I could expand the scope here a bit? As I'm looking at the data, the genitive feminine "na" has Case=Gen (Conradh na Gaeilge) but the genitive masculine "an" generally does not have this feature. @tlynn747: Any reason to treat masculine/feminine differently? Worth going through and adding Case=Gen to the masculine example?

tlynn747 commented 3 years ago

I'm fine with the expansion of scope. Like most things, this feature inclusion comes from the original gold POS-tagged corpus Stiúrthóir stiúrthóir+Noun+Masc+Com+Sg na na+Art+Pl+Def nIonchúiseamh v Conradh conradh+Noun+Masc+Com+Sg na na+Art+Gen+Sg+Def+Fem Gaeilge Gaeilge+Prop+Noun+Fem+Gen+Sg+DefArt

Elaine might have specified a reason for this decision in her thesis, but I don't see any harm in including it if it helps the parser.

kscanne commented 3 years ago

Ok, just fixed all of these manually. Here is the final distribution of tags on articles an/na:

3941 Definite=Def|Number=Sing|PronType=Art 1096 Definite=Def|Number=Plur|PronType=Art 1014 Case=Gen|Definite=Def|Gender=Fem|Number=Sing|PronType=Art 758 Case=Gen|Definite=Def|Gender=Masc|Number=Sing|PronType=Art 449 Case=Gen|Definite=Def|Number=Plur|PronType=Art

I didn't add Gender to genitive plurals since it's not reflected in the surface forms in any way, and the info is in the NOUN. Will submit this in next PR.