Closed laurenCassidy closed 3 years ago
For what it's worth, I followed the Irish example when deciding on this for Manx.
For what it's worth, I followed the Irish example when deciding on this for Manx.
I had suspected that :)
I took lead from Elaine's NCI POS-tagged text. I'm not opposed to changing the lemma to 'an'. It makes sense given the morphological features and would be better to keep cross-lingual consistency .
Would be better to align with gd certainly, and I'll switch Manx as well once the change is made here.
Note there's exactly one "na" in the corpus that's not DET in sentence 1629... that lemma should remain as "i".
@laurenCassidy: do you think you'll try making this change soon? I'm reluctant to keep progressing on the noun features until this is sorted out, for fear of a huge merge conflict!
Thanks @kscanne I can hopefully make the change today - I had to check the correct way to do it as I have never done it before. So my plan is to fork, make the changes and then submit a pull request... If I have any problems I will let you know so that you can go ahead and I can try again when you are finished!
@kscanne I didn't get a chance to do this today so you can go ahead with your changes and let me know when you are done :) thanks
I'll go ahead and make this change once my next PR is merged.
Maybe I could expand the scope here a bit? As I'm looking at the data, the genitive feminine "na" has Case=Gen (Conradh na Gaeilge) but the genitive masculine "an" generally does not have this feature. @tlynn747: Any reason to treat masculine/feminine differently? Worth going through and adding Case=Gen to the masculine example?
I'm fine with the expansion of scope. Like most things, this feature inclusion comes from the original gold POS-tagged corpus Stiúrthóir stiúrthóir+Noun+Masc+Com+Sg na na+Art+Pl+Def nIonchúiseamh v Conradh conradh+Noun+Masc+Com+Sg na na+Art+Gen+Sg+Def+Fem Gaeilge Gaeilge+Prop+Noun+Fem+Gen+Sg+DefArt
Elaine might have specified a reason for this decision in her thesis, but I don't see any harm in including it if it helps the parser.
Ok, just fixed all of these manually. Here is the final distribution of tags on articles an/na:
3941 Definite=Def|Number=Sing|PronType=Art 1096 Definite=Def|Number=Plur|PronType=Art 1014 Case=Gen|Definite=Def|Gender=Fem|Number=Sing|PronType=Art 758 Case=Gen|Definite=Def|Gender=Masc|Number=Sing|PronType=Art 449 Case=Gen|Definite=Def|Number=Plur|PronType=Art
I didn't add Gender to genitive plurals since it's not reflected in the surface forms in any way, and the info is in the NOUN. Will submit this in next PR.
The Irish definite article encodes information about number, gender and case. Should this be normalised in the lemma column? (i.e. an, na -> an) There is no consensus among UD treebanks. From a quick scan of some of the languages that have similar morphological information in the definite article:
Breton an, al, ar -> an Catalan: el, la, l', els, les -> el French: le, la, l', les -> le Greek: ο, η, το, οι, οι, τα -> ο Scottish Gaelic: an, na -> an Spanish : el, la, los las -> el
However, in the Manx and some German treebanks each type of article has a distinct lemma.