UniversalDependencies / UD_Irish-IDT

Irish data
Other
6 stars 7 forks source link

NounType clarifications #125

Closed kscanne closed 3 years ago

kscanne commented 3 years ago

Currently these features are used inconsistently (Slender/NotSlender, Weak/Strong).

I should be able to correct them all with a script, but I need a couple of clarifications on their intended scope (and it might be good to add notes to the Irish guidelines: https://universaldependencies.org/ga/feat/NounType.html)

  1. Is NounType=Slender/NotSlender intended to be used for adjectives modifying genitive plural nouns? e.g. "Bord na gCeantar Cúng". I'd say it's not really needed since these would all be NotSlender, and since I presume the feature is intended to flag when lenition of the adjective is needed... not relevant in the genitive.

  2. NounType=Weak/Strong is used primarily on genitive plural nouns currently, which makes sense, but I see a couple hundred places where it's used on nominative nouns, and even on some adjectives. Restricting to genitive plural seems reasonable since again that's the only time its relevant; here, for predicting the form of the adjective that follows.

tlynn747 commented 3 years ago

A few things probably need to be noted here first. If the inconsistencies are across the board, then it's a problem. If they are only in the latter part of the training set as discussed before, then it's understandable. ALL the features have been predicted and the linguists' contracts ended before they could review them. I think it's important to differentiate between what is a general inconsistency in annotation and what is part of the work-in-progress nature of the treebank development.

The origins of the features are from Elaine's morphological analyser and we mapped them to UD. sent 2121 ábhair nithiúla a úsáid nithiúla nithiúil ADJ Adj NounType=Slender|Number=Plur

originated from: ábhair ábhar+Noun+Masc+Com+Pl nithiúla nithiúil+Adj+Com+Slender+Pl

sent 791 Bord na gCeantar Cúng Cúng cúng ADJ Adj Case=Gen|NounType=Weak|Number=Plur

originated from: gCeantar ceantar+Noun+Masc+Gen+Weak+Pl+DefArt Cúng cúng+Adj+Gen+Weak+Pl

See Elaine's thesis, Appendix B, Table 1 and 3 (or attached screenshots if they display OK)

PhD_Elaine_Final_searchable.pdf

Screenshot 2021-04-12 at 15 03 13 Screenshot 2021-04-12 at 15 02 51

I can't find any discussion in the text re Weak/Strong but Chapter 5 discusses Slender/Broad in some instances.

If this info is already captured by these PAROLE tags, which we have mapped to UD, then I opt for richer data and vote to keep what we have. Cleanup is definitely needed for the predicted tags but I'd be interested to know what errors were in the original data..

I have a list of items to add to the documentation, so I can include this stuff too.

kscanne commented 3 years ago

Ok, I'll keep NounType for genitive plurals (all NotSlender) and will add it to plural adjectives when it's missing.

Regarding the inconsistencies (unless I'm not understanding something in the guidelines) there are problems in all three files. See sentence 2 for example; in "cinn óga", the adjective is annotated NotSlender and in "scoilteanna fada" there's no NounType feature.

tlynn747 commented 3 years ago

OK that's what I meant by - if the inconsistencies are across the board then there's a problem -

The original version of that sentence is: Bíonn bí+Verb+PresImp cinn ceann+Noun+Masc+Com+Pl óga óg+Adj+Com+NotSlen+Pl ........ scoilteanna scoilt+Noun+Fem+Com+Pl fada fada+Adj+Fem+Com+Pl

I'm sure I've already shared this original POS tagged file with you, but I'll send it on again and you'll be able to use it as a reference to see if it's a mapping problem, a new bug or an inconsistency - which can help determine the way forward.

kscanne commented 3 years ago

Thanks — looking through the PAROLE data now. This helps give some clarity in terms of what the features/guidelines ought be for the treebank:

  1. Slender/NotSlender is only used on plural adjectives, and not in the genitive
  2. Strong/Weak is used only on genitive plural nouns and the adjectives that modify them.

This has the side benefit that there's no need for two feature values "NounType=NotSlender,Weak" on genitive plural adjectives like in the "Cúng" example above.

If you're ok with the above, I'll make this consistent throughout.

tlynn747 commented 3 years ago

Yes Yes and Yes :)