Closed tlynn747 closed 3 years ago
Nice! This will be a big improvement for sure. There's no real difference syntactically between the examples that are currently treated as compound prepositions and others that aren't ("le haghaidh" for example). This will bring them more in line with each other.
éis, cionn, comhair are all substantives; NOUN is certainly correct for those — I can't think of an example that wouldn't be ADP+NOUN.
If I could expand the scope of this issue a tiny bit, it would be good to document exactly the list of compound prepositions that are annotated with "fixed" on the page https://universaldependencies.org/ga/dep/fixed.html (and with a cross-reference to that list from https://universaldependencies.org/ga/pos/ADP.html).
I'm not against keeping some list of them as fixed, since I think this has value cross-linguistically, mainly becase it would be strange for some of them to have the NOUN in the compound be the head of the PP (like your cranberry examples above). But there's a spectrum for sure; the Christian Brothers list examples like "i lár" as a compound preposition, but I'd much prefer "lár" as the head in "i lár na páirce". I'm not sure what the right criterion is for deciding this though.
Issue #65 would fall under the scope of any wider review of these compound preps.
Only a few demonstrative examples are given of compound prepositions on those two links. The full list is under the feature description: https://universaldependencies.org/ga/feat/PrepForm.html
Working on these now - but first step is to assign PrepForm=Cmpd to both tokens before converting the 2nd ADP to NOUN. Also planning to keep Cmpd as XPOS for these nouns.
Updated POS tags for nominal element of compound prepositions: UPOS = NOUN XPOS = Cmpd
This XPOS might help with differentiating the potential "substantive nouns" or cranberry words - separating them from regular nouns. Not tied to it though if you think they should be treated like all nouns.
Features of this NOUN now include PrepForm=Cmpd
This is a huge improvement! A couple more things we might consider before closing this issue: (1) There are still some of these compound prepositions that aren't marked as fixed or PrepForm=Cmpd and should be retagged for consistency, e.g. "I dteannta" in sentence 985, "I rith" in sentences 1961, 4128, 4205, etc. Probably needs a full review. (2) Could we demutate the lemmas across the board? For compound prepositions like "de bharr", the lemma of the noun is currently "bharr"... "barr" would be much better.
Yes for both of these!
I know you have a nifty way of pulling out dodgy lemmas - do you have a script that can help with (1)?
I just handled (2) and those will be in the next PR I submit.
I wrote a quick script to search for any of the fixed compound phrases that don't have the fixed deprel on the second token. Important to note that there are surely plenty of false positives here since I didn't check to see if these are playing the role of a preposition with a nominal head. For example, "ar aghaidh" is officially on the list but is almost always just an adverb ("dul ar aghaidh" vs. the relatively rare "ar aghaidh an dorais").
Results needing review: https://gist.github.com/kscanne/c36f018308b9ff77e105374f453be6f6
The other thing to work out is what to do with the noun features that will exist for the nouns that get retagged as "fixed" and PrepForm=Cmpd. Based on your earlier remarks I'm guessing you'll prefer keeping those, but then for consistency would be good to add them globally.
That's a really useful list. GRMA. Just looking at a few of them and you're right about the false positives: i láthair in the test file is part of a light verb construction "cuirtear i láthair" so I think that's an outlier. And not so sure about i bhfianaise either - I think bhfianaise should be the head of the PP Same for in ionad, i dteannta etc I'm happy to handle them as they need a manual check for the reasons you point out and these examples here.
Re the features. PrepForm relates to prepositional form - I wanted to capture that the NOUN was part of a compound preposition. So this wouldn't apply to all fixed phrases.
Do you think those NOUNS should have regular noun features also? I'm open to that, just didn't have the capacity for it in this recent change!
That's right re: that "i láthair" example. I expect most of the "ar aghaidh" examples are similarly just adverbials. At minimum the noun should have an nmod dependent... normally I'd also say to look for one with Case=Gen, but there are lots of exceptions to that rule.
I definitely like PrepForm on both the ADP and NOUN in these fixed phrases.
And I'd be in favor of keeping the noun features... this will make it possible to do a lot of checks without needed special cases for the fixed phrases.
I just went through all of these and added PrepForm=Cmpd where it was missing. There are a few other phrases that I would consider to be compound prepositions but which aren't in your documented list:
faoi réir, as in sentence 264: Faoi réir fho-alt (5) "Subject to subsection (5)" i leith, as in sentence 3062: caiteachas eile i leith oiliúna "other spending on training"
And some variant spellings of compound prepositions that are currently on the documented list:
i gcomhair, in several places, e.g. sentence 1802: scríobh sé i gcomhair thréimhseachán an Bhainc Ceannais "He wrote for the Central Bank's periodical". This is spelled "i gcóir" in your list and in the Christian Brothers' grammar despite being "i gcomhair" in FGB.
fé bhráid, alt of "faoi bhráid"
ós cionn, alt of "os cionn"
in ainneoin, similar to "d'ainneoin"
These all seem like reasonable additions?
Yes all very reasonable additions, thanks for finding them.
I hadn't noticed the "i gcóir" instance - I only know the "i gcomhair" version myself!
Have done manual review based on list provided above. Results below as an FYI.
Also note that I added PrepForm=Cmpd to the existing NOUN features (instead of replacing them). Not sure if this is a help or a hindrance for a global update of features nouns in compound prepositions!
ga_idt-ud-dev.conllu fixed? Line 664 — i gcuideachta NO Line 1781 — D' ainneoin YES Line 3957 — Le linn YES Line 6511 — i láthair NO Line 7657 — In aice YES Line 7692 — i gcaitheamh YES Line 11020 — i measc YES Line 11295 — i measc YES
ga_idt-ud-test.conllu Line 2218 — i gCeann NO Line 2338 — i láthair NO Line 2855 — i measc YES Line 3229 — i gcaitheamh YES Line 3587 — le haghaidh YES Line 3859 — ar Aghaidh NO Line 4166 — le haghaidh YES Line 4913 — De réir YES Line 4937 — De réir YES Line 6057 — le haghaidh YES Line 7658 — I measc YES Line 7885 — ar aghaidh YES Line 8735 — i measc YES Line 9981 — i measc YES
ga_idt-ud-train.conllu Line 447 — ar aghaidh NO Line 1337 — le haghaidh YES Line 2725 — ar aghaidh NO Line 3250 — I dteannta YES Line 5582 — ar aghaidh NO Line 7158 — ar aghaidh NO Line 8665 — ar aghaidh NO Line 9744 — in ionad NO (cuir in ionad) Line 10188 — I measc YES Line 10884 — ar aghaidh NO Line 11755 — I measc YES Line 12261 — i measc YES Line 14042 — I ndiaidh YES Line 14287 — Ar nós YES Line 14513 — i measc YES Line 15499 — i gcaitheamh YES Line 15590 — i measc YES Line 15605 — ar chúl YES Line 17387 — in ionad YES Line 17595 — i bhfianaise YES Line 22365 — ar aghaidh NO Line 23159 — le haghaidh YES Line 24673 — ar aghaidh NO Line 24716 — ar aghaidh NO Line 26868 — i bhfianaise NO Line 27625 — i Rith YES Line 31248 — i gcaitheamh YES Line 31634 — De réir YES Line 34333 — i measc YES Line 36595 — I gceann YES Line 36759 — faoi cheann NO (...de na) Line 37224 — in ionad NO (cuir in ionad) Line 37381 — i measc YES Line 37542 — i measc YES Line 38393 — le haghaidh YES Line 41793 — I gceann YES Line 42433 — i Lár YES Line 43943 — Ar nós YES Line 44057 — D' ainneoin YES Line 44771 — ar aghaidh YES Line 44874 — le haghaidh YES Line 45804 — ar aghaidh NO Line 46264 — ar aghaidh NO Line 46330 — go Ceann NO Line 46526 — I bhfianaise YES Line 49823 — go Ceann NO Line 50041 — le haghaidh NO Line 50806 — ar aghaidh NO Line 51812 — ar aghaidh NO Line 52715 — in ionad YES Line 53233 — faoi bhráid YES Line 53402 — le haghaidh NO Line 53926 — I measc YES Line 54061 — in ionad YES Line 54625 — ar nós YES Line 55105 — in Ionad NO Line 56369 — i láthair NO (cuir..) Line 56391 — in Ionad NO Line 56432 — Le linn YES Line 56691 — ar aghaidh NO Line 56865 — ar fud YES Line 57982 — I dteannta YES (fixed) Line 58203 — i bhfianaise YES Line 58618 — I rith YES Line 58630 — i measc YES Line 58948 — I measc YES Line 59024 — ar aghaidh NO Line 59525 — ar aghaidh NO Line 60250 — le haghaidh YES Line 60654 — in ionad YES Line 61518 — le haghaidh YES Line 61797 — le haghaidh YES Line 62009 — le haghaidh YES Line 62037 — faoi bhráid YES Line 62242 — in ionad YES Line 62551 — D' ainneoin YES Line 63374 — le haghaidh YES Line 63706 — i gcaitheamh YES Line 64155 — ar aghaidh NO Line 64823 — i measc YES Line 65424 — i measc YES Line 65779 — i láthair NO Line 66583 — ar aghaidh NO Line 66930 — i measc YES Line 67031 — i measc YES Line 68528 — faoi bhráid YES Line 68740 — le haghaidh YES Line 68743 — ar nós YES Line 70276 — le haghaidh YES Line 70476 — le haghaidh YES Line 70916 — le haghaidh YES Line 71564 — le haghaidh YES Line 72611 — ar aghaidh NO Line 72693 — le haghaidh YES Line 72930 — le haghaidh YES Line 73227 — d' ainneoin YES Line 73261 — in ionad NO (cuir) Line 73566 — ar aghaidh NO Line 74368 — i dteannta YES (fixed, not CmpdPrep) Line 74531 — le haghaidh YES Line 74959 — i láthair NO Line 76093 — faoi bhun YES Line 76435 — i measc YES Line 77193 — le haghaidh YES Line 77740 — le haghaidh YES Line 77751 — ar nós YES Line 78383 — ar aghaidh NO Line 78493 — i dteannta YES (fixed, not CmpdPrep) Line 79133 — le haghaidh YES Line 79688 — le haghaidh YES Line 79729 — le haghaidh YES Line 79927 — le haghaidh YES Line 80967 — ar aghaidh NO Line 82439 — ar aghaidh NO Line 82793 — i measc YES Line 83262 — Le haghaidh YES Line 83746 — le haghaidh YES Line 83967 — Le linn YES Line 84570 — i measc YES Line 85046 — I rith YES Line 86024 — ar aghaidh NO Line 86912 — le haghaidh YES Line 87346 — I rith YES Line 87387 — le haghaidh YES Line 87585 — I measc YES Line 87598 — le haghaidh YES Line 88218 — le haghaidh YES Line 88656 — le haghaidh YES Line 88962 — le haghaidh YES Line 89181 — le haghaidh YES Line 89928 — ar aghaidh NO Line 90836 — le haghaidh YES Line 90850 — faoi bhráid YES Line 91694 — in ionad YES Line 91731 — i dteannta YES (i dteannta leis - fixed, not CmpdPrep) Line 92601 — le haghaidh YES Line 92683 — ar aghaidh NO Line 93000 — De réir YES Line 93148 — in ionad YES Line 93428 — le haghaidh YES Line 94048 — ar nós NO Line 94089 — Le linn YES Line 95077 — le haghaidh YES Line 95520 — le haghaidh YES Line 96784 — i measc YES Line 96831 — faoi bhráid YES Line 99423 — in ionad YES Line 99506 — ar aghaidh NO Line 99646 — le haghaidh YES Line 100895 — le haghaidh YES Line 101243 — Le linn YES Line 101357 — i measc YES Line 101987 — i bhfianaise YES Line 104733 — le haghaidh YES Line 106680 — ar aghaidh NO Line 106874 — i measc YES Line 107612 — le haghaidh YES
Thanks for going through these. All that's left here is to add the correct NOUN features where they're needed, and I just wrote a script to do that. Will submit a PR once you've merged #134.
Compound prepositions (majority of which are clear prepositions + nouns) are currently tagged as ADP + ADP.
This was because in the original Irish Dependency Treebank the compound prepositions were multiword tokens (in_aice, os_cionn, etc) UD v2 didn't allow for multiword tokens and they were subsequently split, using ADP for both tokens because as a whole the role of the unit is a preposition and not all 2nd tokens are clearly nouns. https://github.com/fosterjen/Irish-Universal-Dependency-Treebank/issues/85
However, since then, enhanced UD guidelines stipulates that "adpositions can take the form of fixed multiword expressions, such as in spite of, because of, thanks to. The component words are then still tagged according to their basic use" https://universaldependencies.org/u/pos/all.html#adp-adposition
The compound prepositions trigger the genitive case in the subsequent noun, due to the presence of a noun in the fixed expression. While the parser seems to be able to predict Case=Gen for these nouns based on the fixed dep label, changing the second token's POS to NOUN.
However, some of these 2nd tokens are cranberry words (e.g. éis, cionn, comhair) and are not necessarily nouns.
Any thoughts on this? Is it an artefact from Old Irish? I want to be consistent in how they're labelled, so it's not ideal to label some as NOUN and others as ADP, nor is is ideal to just assign NOUN to a word because the others follow the same pattern.