Closed kscanne closed 3 years ago
Interesting Q.... I had understood that "a" is direct (nsubj/obj trace) and "ina" and "inar" are indirect (obl). In which cases would "a" represent an indirect marker?
WRT features I tried to see if an indirect/direct features are captured in other treebanks, but it's not so easy to find. It seems where PronType=Ind is used it's an indefinite marker.
The list of language specific features doesn't indicate any deviation from this either: https://universaldependencies.org/ext-feat-index.html
So it's a question of whether we want to introduce a new language specific feature Value. Often the argument against this is whether or not the information is retrievable through the combination of POS tag and dep label. In this case you could argue that they could be
PART Vb PartType=Vb|PronType=Rel N nsubj PART Vb PartType=Vb|PronType=Rel N obj
being direct, and
PART Vb PartType=Vb|PronType=Rel N obl being indirect
Which relies on correct labelling of the dep rels....
Leading to the second point, I'm hoping those incorrect ones just belong to the predicted labels? Happy to help review them.
Examples of "a" as indirect relative would be things like "teach a bhfuil ceithre sheomra ann" (sentence 921), "aon achtachán a ndéantar leasú air" (sentence 927). There are also the cases more like your "ina" example, but with the preposition written separately "an ráta ar a gcuirtear obair i gcrích" (sentence 951). In these three cases the particle has deprel "obl".
I'd be fine with the info being retrievable from the dependency relation if that's preferred in UD.
Here are a couple that look mislabeled in training file: "ag ceann an bhóithrín a mbínn ag déanamh gairdeasa faoi" (Sentence 912)... annotated nsubj but there's a first person subject as part of the verb in this case. "...Ghaelscoil úr ... a bhfuil naonúr páistí ar a rollaí" (Sentence 1675). Subject is "naonúr páistí". "bhfuil" should probably also be conj here as well(?), or if it's acl:relcl it should have "Ghaelscoil" as the head.
The frequency of these does pick up later in the file.
Here are the deprels currently assigned to "a" when it has the PronType=Rel feature:
741 nsubj
673 mark:prt
383 obj
235 obl
18 obl:tmod
17 mark
1 fixed
Is there a rule for deciding which are mark:prt? I see these after "nuair" for example, but there are also indirect relatives with this label.
My (undocumented) decision for gd was that they were mark:prt
and that nsubj
, obj
, obl
should go into the enhanced dependencies.
Have I grasped the wrong end of the stick, though?
My (undocumented) decision for gd was that they were
mark:prt
and thatnsubj
,obj
,obl
should go into the enhanced dependencies.Have I grasped the wrong end of the stick, though?
This seems like a perfectly good approach as well (for my own reference, documented here: https://universaldependencies.org/u/overview/enhanced-syntax.html#relative-clauses).
I'd like to check out what you've done, but I don't see the enhanced dependencies on the dev branch of the gd treebank... is this work in progress?
Examples of "a" as indirect relative would be things like "teach a bhfuil ceithre sheomra ann" (sentence 921), "aon achtachán a ndéantar leasú air" (sentence 927). There are also the cases more like your "ina" example, but with the preposition written separately "an ráta ar a gcuirtear obair i gcrích" (sentence 951). In these three cases the particle has deprel "obl".
I'd be fine with the info being retrievable from the dependency relation if that's preferred in UD.
Here are a couple that look mislabeled in training file: "ag ceann an bhóithrín a mbínn ag déanamh gairdeasa faoi" (Sentence 912)... annotated nsubj but there's a first person subject as part of the verb in this case. "...Ghaelscoil úr ... a bhfuil naonúr páistí ar a rollaí" (Sentence 1675). Subject is "naonúr páistí". "bhfuil" should probably also be conj here as well(?), or if it's acl:relcl it should have "Ghaelscoil" as the head.
The frequency of these does pick up later in the file.
Here are the deprels currently assigned to "a" when it has the PronType=Rel feature:
741 nsubj 673 mark:prt 383 obj 235 obl 18 obl:tmod 17 mark 1 fixed
Is there a rule for deciding which are mark:prt? I see these after "nuair" for example, but there are also indirect relatives with this label.
Yes 912 and 1675 are bugs for sure. But interestingly the obl could be seen as the attachment of déanamh and not mbínn!
I don't believe indirect relatives should be mark:prt - if they are, they should have been corrected by the reviewers.
mark:prt is used for relative clause markers Nuair a , a deir, Sa bhaile chomh maith a bhí Máire Ní Choilm, etc
But if it's a relative clause pronoun (who, to-whom, which etc) then it takes its grammatical argument role (nsubj, obj, obl) as per the UD guidelines. This change was introduced instead of a "rel" label with the v2 guidelines. I discussed the Irish cases with Joakim Nivre at the time. https://universaldependencies.org/u/dep/all.html#mark-marker
Some treebanks may chose to only display this in enhanced dependencies (ED) but ED is not a priority right now as there are too many bugs to clean up in the existing data.
@kscanne I'm happy to share the cleanup/ review of these "a"s
Lots going on here. Thanks for the offer to review — I'm happy to split the work once we settle on the right annotation!
After pondering this for a while, I don't think relying on the deprels is going to suffice for a few reasons. First, because of examples like:
an scríbhneoir a molann na mic léinn é vs. an scríbhneoir a mholann na mic léinn
(straight from p. 6 of McCloskey's "Transformational Syntax..." book). The first case is the indirect relative because of the resumptive pronoun "é", but the second is direct because it's omitted. Both are perfectly grammatical, and in both cases the current annotation scheme would assign "obj" to the relativizer (if I understand correctly).
Second, in the indirect case, the relativized noun can play roles other than obj or obl. For example, it can correspond to a possessive like in "An fear a raibh a mhac san ospidéal". There are examples like this in the treebank, and the ones I've found so far mark the relativizer as obl — that seems wrong. Sentence 1164 is of this type: "Sin an fear a bhfuil a mhac ag imeacht", as is sentence 4013: "... i gcoigeartú struchtúrach na réigiún a bhfuil a bhforbairt tite ar gcúl..." ("in structural adjustment of the regions whose development has fallen behind")
Third, there are alternations like "mar atá" vs. "mar a bhfuil" or "mar a bhíonn vs. "mar a mbíonn" (see sentences 3674 and 3621 for examples of the latter two) and nothing in the annotation now to distinguish the two different relativizers at play.
I agree with mark:prt in cases like "Nuair a..." or, say, "Fad a bhí..." but cleft examples like "Sa bhaile chomh maith a bhí Máire Ní Choilm" could be analyzed as nsubj, obj, or obl (obl in this case I'd say by looking at the declefted version).
Final complication is that there's also a third, different "a"; this is the one meaning "all that" (cf. Christian Brothers p.145 under "Compound Relative" and sense 2 of the entry for the relative particle "a" in FGB: https://www.teanglann.ie/ga/fgb/a). This is usually analyzed as a pronoun whereas the other two are particles. In the treebank they're not distinguished from the relative particles even though they are very different syntactically; there's a good example in sentence 1896: "...ionsaí an Tribune ar a bhfuil fágtha de shaoirsí sibhialta agam sa stát lofa seo..." ("the Tribune's attack on all that is left of my civil liberties in this rotten state").
Upshot of all of this is that I think we need a new feature value to distinguish the two relativizing particles; I don't see another way to get this information out of the current scheme. McCloskey gives them the names "aL" and "aN" (leniting/nasalizing). I'm for anything as long as the info is there!
I'd like to check out what you've done, but I don't see the enhanced dependencies on the dev branch of the gd treebank... is this work in progress?
Not yet, no. Ordinary dependencies not finished yet and quite a lot to fix still!
@kscanne Just jotting down notes from our video call:
There are 6 "a"s at play here. The decision to be made is around the introduction of a new feature that captures whether a relative pronoun is direct or indirect.
I'll try to summarise what they are and give the proposed dealings for each in terms of dependency labels and features.
Propose to keep these as mark:prt and add the feature Form=Direct
cé mhéad fuinnimh a bheidh le fáil "how much energy would be available"
In this case a is marked nsubj - (Beidh fuinnimh le fáil)
rudaí a sheoltar ar ais "things that were sent back" In this case a is labelled obj (Seoltar rudaí ar ais)
Both of these would have the feature Form=Direct
This will take the feature Form=Indirect
We can't attach two objects to the verb, so the a is resumptive. Propose to label this as mark:prt, but still give it the feature Form=Indirect
Propose to label these as mark:prt and also to assign the feature Form=Indirect
In this case, a is nsubj. The dependency labels are assigned as per the relative particles in (1)
Propose to change the UPOS from PART to PRON. Currently of the position that these additional Form features are not required.
One question here is whether other languages capture this distinction between Direct and Indirect? Attempts to find out through sifting through UD documentation and Grew-match searches have failed! @ftyers oh speaker of many UD languages! Do you know of other languages that would have needed to make this distinction? (Scroll up to the start of this thread - the need in Irish is based on morphological-syntactic triggers of eclipsis/lenition etc)
Hi there! Sorry for the delay in replying. :) So, starting with Breton, Press (1986) in §3.2.2, there seems to be something similar:
There are also some notes from Stephens (1993),
The few examples that I have in the treebank have a completely different structure,
# sent_id = grammar.vislcg.txt:60:1174
# text = Setu an den a gontas an istor dimp.
# text[eng] = Here is the man who told us the story.
# labels = stephens_1993 to_check
1 Setu setu ADV adv _ 0 root _ _
2 an an DET det _ 3 det _ _
3 den den NOUN n Gender=Masc|Number=Sing 1 nsubj _ _
4 a a AUX vpart _ 5 aux _ _
5 gontas kontañ VERB vblex Number=Sing|Person=3|Tense=Past|VerbForm=Fin 3 acl _ _
6 an an DET det _ 7 det _ _
7 istor istor NOUN n Gender=Masc|Number=Sing 5 obj _ _
8-9 dimp _ _ _ _ _ _ _ SpaceAfter=No
8 d da ADP pr _ 9 case _ _
9 imp indirect PRON prn Case=Acc|Number=Plur|Person=1|PronType=Prs 5 obl _ _
10 . . PUNCT sent _ 1 punct _ _-
But I'm not set on it. And this would seem like something useful to standardise on inter-Celtically! Perhaps we could move the issue to the main issues discussion and call in @jheinecke too?
Hi all,
Welsh is very similar to Breton in this respect. a is a relative pronoun which can be either subject or object (context dependent). In the Welsh treebank it is annotated as such and the head of a (always a verb) is an acl:relcl
:
# text = Dyma restr o wledydd a ddaeth yn annibynnol oddi wrth Sbaen
# test[en] = Here is a list of countries who became independent from Spain
1 Dyma dyma ADV adv _ 2 advmod _ _
2 restr rhestr NOUN noun Gender=Fem|Mutation=SM|Number=Sing 0 root _ _
3 o o ADP prep _ 4 case _ _
4 wledydd gwlad NOUN noun Gender=Fem|Mutation=SM|Number=Plur 2 nmod _ _
5 a a PRON rel PronType=Rel 6 nsubj _ _
6 ddaeth dod VERB verb Mood=Ind|Mutation=SM|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 4 acl:relcl _ _
7 yn yn PART pred _ 8 case:pred _ _
8 annibynnol annibynnol ADJ pos Degree=Pos 6 advmod _ _
9 oddi oddi ADP prep _ 11 case _ _
10 wrth wrth ADP prep _ 9 fixed _ _
11 Sbaen Sbaen PROPN place Gender=Masc|Number=Sing 8 obl _ SpaceAfter=No
This Welsh relative pronoun a is the same as in Breton. This why I disagree with the Breton sentence shown by @ftyers:
4 a a AUX vpart _ 5 aux _ _
should be
4 a a PRON vpart _ 5 nsubj _ _
In Breton the "relative construction" has nearly become the standard construction. me a zo brezhoneg "I'am Breton" (lit. "It [is] me who is Breton").
In the sentence from Press (1986) (Ar c'haz a gavas va c'hi anezhañ) a is subject of kavas and the presence of anezhañ makes this non-ambiguous. I would anezhañ annotate as obl
to kavas
1 Ar an DET 2 det
2 c'haz kaz NOUN 0 root
3 a a PRON 4 obj
4 gavas kavout VERB 2 acl:relcl
5 va va PRON 6 nmod:poss
6 c'hi ki NOUN 4 nsubj
7-8 anezhañ _ _ _ _
7 a a ADP 8 case
8 hi hi PRON 4 obl
lit. "The cat which my dog found of him"
Construction like Breton ar poatr a oan a kaozeal gantañ "the boy I was chatting with" do exist in Welsh as well: y rhaglen a gwrandais i arno "the programme I was listening to".
I agree I common annotation at least for the Celtic languages would be great, I'll go for a PRON
and for acl:relcl
of its head
Hi @ftyers and @jheinecke - if i understand correctly you're saying that the instances of the relativisers "a" that you list above are more akin to examples 2 and 3 above?
Do you have similar uses to 1, 4, 5 and 6 at all?
And @ftyers, back to the other q - do you know of other languages that need to distinguish between Direct/Indirect in their features?
Hi @tlynn747, at least for Welsh examples like 2 and 3 occur frequently. Examples like 1) Nuair a chonaic mé "when I saw" would not have a a only a soft mutation: Pan welais i "when I saw" 4) "the writer who the students praise" would use y as «relator» yr ysgrifennwr y mae'r myfyrwyr yn ei ganmol (it's more a coordination than a relative pronoun; "The writer [,] the students praise him". But "the writer who I saw": yr ysgrifennwr a welais i. NB. there are just two quick examples, I did not have the time yet to search more thoroughly, cf. Gareth Kind: Modern Welsh, Routledge 1993, §485 I'll check for 5 and 6
I'm closing this now since the Irish features have been added. Happy to discuss harmonization in an issue for the main project if anyone's interested.
Yep, it would be good to continue in the main docs issues :)
Both varieties of the relative marker "a" are annotated as PART with features PartType=Vb|PronType=Rel, and so can't currently be distinguished without reference to the wider context. Adding this info could potentially improve parser accuracy — I've noticed that some of the indirect relatives that should be obl are annotated as nsubj or obj. Having this would also make it easier to do grammar checking and QA of the treebank.
Any thoughts on the best way to include this extra information? I'm happy to do the work to add whatever annotation is deemed best to capture this distinction.