UniversalDependencies / UD_Irish-IDT

Irish data
Other
6 stars 7 forks source link

Features for surface form vs. syntactic role #63

Closed kscanne closed 3 years ago

kscanne commented 3 years ago

Generally I'm looking for guidance on when to apply various features in the treebank, and whether the features correspond to the surface form or the word's syntactic role in the sentence. Any feedback would be appreciated!

Here are some examples where this comes up:

(1) In working on the features for NOUNs I found lots of examples where the sentences stray from strict rules of Irish grammar, e.g. for genitives. For example

ar son a ndíograis (would expect "ar son a ndíograise") ó thaobh na gnáthóga ("ó thaobh na ngnáthóg") tar éis do bhás ("tar éis do bháis")

My inclination is to label these as Case=NomAcc since that's the surface form of the nouns even though they "should" be Genitive. The syntax is still captured by the dependency relations.

(2) The same broad issue comes up with nouns that are nominative in form but are syntactically genitive, like these:

"mairtírigh eile Sheachtain na Cásca" "croílár cháilíocht bheatha Bhaile Átha Cliath" "trí shráideanna lárchathair Bhaile Átha Cliath"

I'd expect Case=NomAcc on these (treebank currently has some of this type as Case=Gen), and again would rely on them being nmod of the preceding NOUN for their syntactic role (they also need Definite=Def)

(3) This also came up with labels for initial mutations; I think we agreed it's best to base these only on the surface form and not on the mutation one would expect in a given context (which might not be there because the mutation doesn't apply, or because of an error by the speaker/writer).

(4) Examples like "cúpla abairt" or "cúig abairt"... should be (I think) Case=NomAcc|Number=Sing. There's currently a mix of NomAcc/Gen after "cúpla" and Sing/Plur after numbers. I can clean those up once I'm sure we're on the same page.

kscanne commented 3 years ago

Since you mention Case=Dat in the last pull request, that's relevant to this discussion also. Current behavior seems to be to only use Case=Dat when it's a spelling unique to the dative (and I manually checked all of those btw).

Especially curious for your input on (2) above since those examples are grammatically correct with the NomAcc spelling but mostly labeled Case=Gen in treebank.

tlynn747 commented 3 years ago

Following Zoom discussion:

Decision taken is to follow the surface form in terms of morphological feature choice (instead of syntactic role).

In other words, even though all objects of prepositions are dative objects, in modern Irish only some words have dative inflection (in Éirinn). Same with strings of multiple nouns, where the internal nouns are clearly modifying another noun in a genitive way, but take the nominative form because of the grammar rules of Irish! Similarly no need for mutations (form = ecl, len) for words beginning with l, r etc. that can't be lenited or mutated.

Finally, nouns that are in a phrase where they are being modified by a number > 1, because of the rules of Irish (keeping singular form) will have Number=Sing instead of Number=Pl

kscanne commented 3 years ago

I'll leave this open until I have a chance to review the cúpla and NUM examples. Most of the others will be sorted when I add Case features as part of resolving issues #44 and #45.