UniversalDependencies / UD_Irish-IDT

Irish data
Other
6 stars 7 forks source link

Definite=Def feature missing in nouns #43

Closed tlynn747 closed 3 years ago

tlynn747 commented 3 years ago

Need to review nouns for this missing feature: Definite=Def

eg. 1332 (mbord)

1 Cuirtear cuir VERB VTI Mood=Ind|Tense=Pres|Voice=Auto 0 root 2 bánéadach bánéadach NOUN Noun Case=NomAcc|Gender=Masc|Number=Sing 1 obj 3 glan glan NOUN Noun Case=NomAcc|Gender=Masc|Number=Sing 2 amod 4 ar ar ADP Simp 6 case 5 an an DET Art Definite=Def|Number=Sing|PronType=Art 6 det _ 6 mbord bord NOUN Noun Case=NomAcc|Form=Ecl|Gender=Masc|Number=Sing 1 obl

kscanne commented 3 years ago

These are also missing from cases which are definite but don't have the definite article, like "Rannóg Dóiteáin Nua-Eabhrac" in sentence 934.

kscanne commented 3 years ago

Working on a (huge) patch for this. For the record, this looks like it can be 100% automated, presuming the UPOS tags and deprels are all correct. Following the definitions in the Christian Brothers' grammar (pp.25-26), here are the rules: (1) any nominal head of a word with PronType=Art, or any nominal following "cén", is definite (2) proper nouns are definite (3) any nominal head of "gach" or a word with Poss=Yes is definite (4) vocative nouns are definite (5) any nominal qualified by "a dó", "a trí", etc. (this is rare, and also slightly tricky to code, to avoid cases where the number is a different kind of nmod of the noun, like "cruinniú tar éis a dó") (6) The recursive rule... any nominal which is the head of a definite noun in the genitive case is itself definite

kscanne commented 3 years ago

There's a subtlety with the recursive rule (6) since it doesn't always hold for nouns like "Gaeilge", "Béarla", or "Gaeltacht" despite these being proper nouns, and thus definite by (2):

Of course, focal/leagan/ceantair here are definite but that comes from the definite article (which would not be permitted if the definiteness propogated from the genitive noun).

tlynn747 commented 3 years ago

Great that it can be automated. Initially I thought it was due to the buggy morphological retrieval process in v2.5, but it seems that there are tons missing in Elaine's original corpus so it became an issue of error propagation. I only noticed it during the feature comparison with the last batch of predicted/corrected trees. I see some pattern where nouns with initial mutations (in the test file at least) are more likely to be missing the Definite feature.

You can add to that list nouns following "san, faoin" etc

kscanne commented 3 years ago

The san, faoin, ón, etc. examples fall under the first case PronType=Art... cén has to be treated separately b/c it doesn't have that feature and the dependency goes in the other direction (it's the head).

Also noticing that case (5) should be expanded slightly to include things like "airteagal 5", "alt (ii)", etc. without the particle "a".