UniversalDependencies / UD_Irish-IDT

Irish data
Other
6 stars 7 forks source link

Is Definite=Ind needed? #156

Closed kscanne closed 1 year ago

kscanne commented 1 year ago

I've been working under the assumption that definite nouns should get the feature Definite=Def, and all others simply leave this feature out (see PR #88 where I added Definite=Def to 13,000+ tokens). But I see there are 29 words annotated with Definite=Ind in the treebank. Would it be simpler to just remove these features? I don't see any rhyme or reason to why these particular words were given Definite=Ind vs. all of the other indefinite nouns. Most are surface tokens "gnóthaí" or "oibre" which suggests some machine-learned weirdness.

There's a note in the docs about words inflecting to show indefiniteness but I'm not sure what that means: https://universaldependencies.org/ga/feat/Definite.html

tlynn747 commented 1 year ago

I've traced this back to Ui Dhonnchadha's original POS-tagged corpus that was used as a basis for the treebank.

e.g. cionn in sent_id = 805: D'éirigh thar cionn léi D' do+Part+Vb éirigh éirigh+Verb+VI+PastInd+Len thar thar+Prep+Simp cionn ceann+Noun+Fem+Dat+Sg+Idf léi le+Pron+Prep+3P+Sg+Fem .

Tags weren't questioned and all Idf were mapped to Definite=Ind https://github.com/fosterjen/Irish-Universal-Dependency-Treebank/blob/master/scripts/mapping.txt

But yes, as you've highlighted, it's not consistently labelled. For cross-lingual purposes however, is it not better to map all nouns that aren't definite to indefinite?

kscanne commented 1 year ago

Thanks for tracking that down — makes sense.

I looked at the Definite feature in several other treebanks and it appears to be used for determiners (a vs the in English) and not nouns, so I'm not sure there's much cross-linguistic value in adding Definite=Ind everywhere.

rueter commented 1 year ago

You might want to look at how the Scandinavian languages deal with their definite and indefinite nouns. They are morphologically marked for plural and definiteness, and yet both Number=Sing and Definite=Ind are found in features.

tlynn747 commented 1 year ago

OK so. Happy for the nouns to be labelled with morph features as per their marking. i.e. no marking, no features labelled.