UniversalDependencies / UD_Irish-IDT

Irish data
Other
6 stars 7 forks source link

Distinguish direct and indirect relative markers in annotation #110

Closed kscanne closed 3 years ago

kscanne commented 3 years ago

Both varieties of the relative marker "a" are annotated as PART with features PartType=Vb|PronType=Rel, and so can't currently be distinguished without reference to the wider context. Adding this info could potentially improve parser accuracy — I've noticed that some of the indirect relatives that should be obl are annotated as nsubj or obj. Having this would also make it easier to do grammar checking and QA of the treebank.

Any thoughts on the best way to include this extra information? I'm happy to do the work to add whatever annotation is deemed best to capture this distinction.

tlynn747 commented 3 years ago

Interesting Q.... I had understood that "a" is direct (nsubj/obj trace) and "ina" and "inar" are indirect (obl). In which cases would "a" represent an indirect marker?

WRT features I tried to see if an indirect/direct features are captured in other treebanks, but it's not so easy to find. It seems where PronType=Ind is used it's an indefinite marker.

The list of language specific features doesn't indicate any deviation from this either: https://universaldependencies.org/ext-feat-index.html

So it's a question of whether we want to introduce a new language specific feature Value. Often the argument against this is whether or not the information is retrievable through the combination of POS tag and dep label. In this case you could argue that they could be

PART Vb PartType=Vb|PronType=Rel N nsubj PART Vb PartType=Vb|PronType=Rel N obj

being direct, and

PART Vb PartType=Vb|PronType=Rel N obl being indirect

Which relies on correct labelling of the dep rels....

Leading to the second point, I'm hoping those incorrect ones just belong to the predicted labels? Happy to help review them.

kscanne commented 3 years ago

Examples of "a" as indirect relative would be things like "teach a bhfuil ceithre sheomra ann" (sentence 921), "aon achtachán a ndéantar leasú air" (sentence 927). There are also the cases more like your "ina" example, but with the preposition written separately "an ráta ar a gcuirtear obair i gcrích" (sentence 951). In these three cases the particle has deprel "obl".

I'd be fine with the info being retrievable from the dependency relation if that's preferred in UD.

Here are a couple that look mislabeled in training file: "ag ceann an bhóithrín a mbínn ag déanamh gairdeasa faoi" (Sentence 912)... annotated nsubj but there's a first person subject as part of the verb in this case. "...Ghaelscoil úr ... a bhfuil naonúr páistí ar a rollaí" (Sentence 1675). Subject is "naonúr páistí". "bhfuil" should probably also be conj here as well(?), or if it's acl:relcl it should have "Ghaelscoil" as the head.

The frequency of these does pick up later in the file.

Here are the deprels currently assigned to "a" when it has the PronType=Rel feature:

741 nsubj
673 mark:prt
383 obj
235 obl
 18 obl:tmod
 17 mark
  1 fixed

Is there a rule for deciding which are mark:prt? I see these after "nuair" for example, but there are also indirect relatives with this label.

colinbatchelor commented 3 years ago

My (undocumented) decision for gd was that they were mark:prt and that nsubj, obj, obl should go into the enhanced dependencies.

Have I grasped the wrong end of the stick, though?

kscanne commented 3 years ago

My (undocumented) decision for gd was that they were mark:prt and that nsubj, obj, obl should go into the enhanced dependencies.

Have I grasped the wrong end of the stick, though?

This seems like a perfectly good approach as well (for my own reference, documented here: https://universaldependencies.org/u/overview/enhanced-syntax.html#relative-clauses).

I'd like to check out what you've done, but I don't see the enhanced dependencies on the dev branch of the gd treebank... is this work in progress?

tlynn747 commented 3 years ago

Examples of "a" as indirect relative would be things like "teach a bhfuil ceithre sheomra ann" (sentence 921), "aon achtachán a ndéantar leasú air" (sentence 927). There are also the cases more like your "ina" example, but with the preposition written separately "an ráta ar a gcuirtear obair i gcrích" (sentence 951). In these three cases the particle has deprel "obl".

I'd be fine with the info being retrievable from the dependency relation if that's preferred in UD.

Here are a couple that look mislabeled in training file: "ag ceann an bhóithrín a mbínn ag déanamh gairdeasa faoi" (Sentence 912)... annotated nsubj but there's a first person subject as part of the verb in this case. "...Ghaelscoil úr ... a bhfuil naonúr páistí ar a rollaí" (Sentence 1675). Subject is "naonúr páistí". "bhfuil" should probably also be conj here as well(?), or if it's acl:relcl it should have "Ghaelscoil" as the head.

The frequency of these does pick up later in the file.

Here are the deprels currently assigned to "a" when it has the PronType=Rel feature:

741 nsubj
673 mark:prt
383 obj
235 obl
 18 obl:tmod
 17 mark
  1 fixed

Is there a rule for deciding which are mark:prt? I see these after "nuair" for example, but there are also indirect relatives with this label.

Yes 912 and 1675 are bugs for sure. But interestingly the obl could be seen as the attachment of déanamh and not mbínn!

I don't believe indirect relatives should be mark:prt - if they are, they should have been corrected by the reviewers.

mark:prt is used for relative clause markers Nuair a , a deir, Sa bhaile chomh maith a bhí Máire Ní Choilm, etc

But if it's a relative clause pronoun (who, to-whom, which etc) then it takes its grammatical argument role (nsubj, obj, obl) as per the UD guidelines. This change was introduced instead of a "rel" label with the v2 guidelines. I discussed the Irish cases with Joakim Nivre at the time. https://universaldependencies.org/u/dep/all.html#mark-marker

Some treebanks may chose to only display this in enhanced dependencies (ED) but ED is not a priority right now as there are too many bugs to clean up in the existing data.

@kscanne I'm happy to share the cleanup/ review of these "a"s

kscanne commented 3 years ago

Lots going on here. Thanks for the offer to review — I'm happy to split the work once we settle on the right annotation!

After pondering this for a while, I don't think relying on the deprels is going to suffice for a few reasons. First, because of examples like:

an scríbhneoir a molann na mic léinn é vs. an scríbhneoir a mholann na mic léinn

(straight from p. 6 of McCloskey's "Transformational Syntax..." book). The first case is the indirect relative because of the resumptive pronoun "é", but the second is direct because it's omitted. Both are perfectly grammatical, and in both cases the current annotation scheme would assign "obj" to the relativizer (if I understand correctly).

Second, in the indirect case, the relativized noun can play roles other than obj or obl. For example, it can correspond to a possessive like in "An fear a raibh a mhac san ospidéal". There are examples like this in the treebank, and the ones I've found so far mark the relativizer as obl — that seems wrong. Sentence 1164 is of this type: "Sin an fear a bhfuil a mhac ag imeacht", as is sentence 4013: "... i gcoigeartú struchtúrach na réigiún a bhfuil a bhforbairt tite ar gcúl..." ("in structural adjustment of the regions whose development has fallen behind")

Third, there are alternations like "mar atá" vs. "mar a bhfuil" or "mar a bhíonn vs. "mar a mbíonn" (see sentences 3674 and 3621 for examples of the latter two) and nothing in the annotation now to distinguish the two different relativizers at play.

I agree with mark:prt in cases like "Nuair a..." or, say, "Fad a bhí..." but cleft examples like "Sa bhaile chomh maith a bhí Máire Ní Choilm" could be analyzed as nsubj, obj, or obl (obl in this case I'd say by looking at the declefted version).

Final complication is that there's also a third, different "a"; this is the one meaning "all that" (cf. Christian Brothers p.145 under "Compound Relative" and sense 2 of the entry for the relative particle "a" in FGB: https://www.teanglann.ie/ga/fgb/a). This is usually analyzed as a pronoun whereas the other two are particles. In the treebank they're not distinguished from the relative particles even though they are very different syntactically; there's a good example in sentence 1896: "...ionsaí an Tribune ar a bhfuil fágtha de shaoirsí sibhialta agam sa stát lofa seo..." ("the Tribune's attack on all that is left of my civil liberties in this rotten state").

Upshot of all of this is that I think we need a new feature value to distinguish the two relativizing particles; I don't see another way to get this information out of the current scheme. McCloskey gives them the names "aL" and "aN" (leniting/nasalizing). I'm for anything as long as the info is there!

colinbatchelor commented 3 years ago

I'd like to check out what you've done, but I don't see the enhanced dependencies on the dev branch of the gd treebank... is this work in progress?

Not yet, no. Ordinary dependencies not finished yet and quite a lot to fix still!

tlynn747 commented 3 years ago

@kscanne Just jotting down notes from our video call:

There are 6 "a"s at play here. The decision to be made is around the introduction of a new feature that captures whether a relative pronoun is direct or indirect.

I'll try to summarise what they are and give the proposed dealings for each in terms of dependency labels and features.

  1. Relative markers such as those found in: Nuair a chonaic mé "when I saw" Fad a bhí sé ann "while he was there" XYZ a duirt sé "XYS he said"

Propose to keep these as mark:prt and add the feature Form=Direct

  1. Relative pronouns that represent a missing subject or object in the relative clause

cé mhéad fuinnimh a bheidh le fáil "how much energy would be available"
In this case a is marked nsubj - (Beidh fuinnimh le fáil)

rudaí a sheoltar ar ais "things that were sent back" In this case a is labelled obj (Seoltar rudaí ar ais)

Both of these would have the feature Form=Direct

  1. Relative pronouns that represent a missing oblique (PP head) an teach inar thug sé an chuid ba mhó dá óige "the house in which he spent most his youth" In this case inar is labelled obl

This will take the feature Form=Indirect

  1. Resumptive pronouns, whereby the element which is represented is actually still present in the relativised clause an scríbhneoir a molann na mic léinn é "the writer who the students praise"

We can't attach two objects to the verb, so the a is resumptive. Propose to label this as mark:prt, but still give it the feature Form=Indirect

  1. Possessive relativisers - where the a represents the missing possessor in the relativised clause
    An fear a raibh a mhac san ospidéal "the man whose son was in hospital"

Propose to label these as mark:prt and also to assign the feature Form=Indirect

  1. Cases of a that are regarded as Pronouns instead of particles (according to FGB https://www.teanglann.ie/ga/fgb/a) An bhfuair tú a bhfuil uait? Did you get what you wanted?

In this case, a is nsubj. The dependency labels are assigned as per the relative particles in (1)

Propose to change the UPOS from PART to PRON. Currently of the position that these additional Form features are not required.

One question here is whether other languages capture this distinction between Direct and Indirect? Attempts to find out through sifting through UD documentation and Grew-match searches have failed! @ftyers oh speaker of many UD languages! Do you know of other languages that would have needed to make this distinction? (Scroll up to the start of this thread - the need in Irish is based on morphological-syntactic triggers of eclipsis/lenition etc)

ftyers commented 3 years ago

Hi there! Sorry for the delay in replying. :) So, starting with Breton, Press (1986) in §3.2.2, there seems to be something similar:

imatge

There are also some notes from Stephens (1993),

imatge

The few examples that I have in the treebank have a completely different structure, imatge

# sent_id = grammar.vislcg.txt:60:1174
# text = Setu an den a gontas an istor dimp.
# text[eng] = Here is the man who told us the story.
# labels = stephens_1993 to_check
1   Setu    setu    ADV adv _   0   root    _   _
2   an  an  DET det _   3   det _   _
3   den den NOUN    n   Gender=Masc|Number=Sing 1   nsubj   _   _
4   a   a   AUX vpart   _   5   aux _   _
5   gontas  kontañ  VERB    vblex   Number=Sing|Person=3|Tense=Past|VerbForm=Fin    3   acl _   _
6   an  an  DET det _   7   det _   _
7   istor   istor   NOUN    n   Gender=Masc|Number=Sing 5   obj _   _
8-9 dimp    _   _   _   _   _   _   _   SpaceAfter=No
8   d   da  ADP pr  _   9   case    _   _
9   imp indirect    PRON    prn Case=Acc|Number=Plur|Person=1|PronType=Prs  5   obl _   _
10  .   .   PUNCT   sent    _   1   punct   _   _-

But I'm not set on it. And this would seem like something useful to standardise on inter-Celtically! Perhaps we could move the issue to the main issues discussion and call in @jheinecke too?

jheinecke commented 3 years ago

Hi all, Welsh is very similar to Breton in this respect. a is a relative pronoun which can be either subject or object (context dependent). In the Welsh treebank it is annotated as such and the head of a (always a verb) is an acl:relcl:

# text = Dyma restr o wledydd a ddaeth yn annibynnol oddi wrth Sbaen
# test[en] = Here is a list of countries who became independent from Spain
1   Dyma    dyma    ADV adv _   2   advmod  _   _
2   restr   rhestr  NOUN    noun    Gender=Fem|Mutation=SM|Number=Sing  0   root    _   _
3   o   o   ADP prep    _   4   case    _   _
4   wledydd gwlad   NOUN    noun    Gender=Fem|Mutation=SM|Number=Plur  2   nmod    _   _
5   a   a   PRON    rel PronType=Rel    6   nsubj   _   _
6   ddaeth  dod VERB    verb    Mood=Ind|Mutation=SM|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   4   acl:relcl   _   _
7   yn  yn  PART    pred    _   8   case:pred   _   _
8   annibynnol  annibynnol  ADJ pos Degree=Pos  6   advmod  _   _
9   oddi    oddi    ADP prep    _   11  case    _   _
10  wrth    wrth    ADP prep    _   9   fixed   _   _
11  Sbaen   Sbaen   PROPN   place   Gender=Masc|Number=Sing 8   obl _   SpaceAfter=No

This Welsh relative pronoun a is the same as in Breton. This why I disagree with the Breton sentence shown by @ftyers:

4   a   a   AUX vpart   _   5   aux _   _

should be

4   a   a   PRON    vpart   _   5   nsubj   _   _

In Breton the "relative construction" has nearly become the standard construction. me a zo brezhoneg "I'am Breton" (lit. "It [is] me who is Breton").

In the sentence from Press (1986) (Ar c'haz a gavas va c'hi anezhañ) a is subject of kavas and the presence of anezhañ makes this non-ambiguous. I would anezhañ annotate as obl to kavas

1   Ar  an  DET 2   det
2   c'haz   kaz NOUN    0   root
3   a   a   PRON    4   obj
4   gavas   kavout  VERB    2   acl:relcl
5   va  va  PRON    6   nmod:poss
6   c'hi    ki  NOUN    4   nsubj
7-8 anezhañ _   _   _   _
7   a   a   ADP 8   case
8   hi  hi  PRON    4   obl

lit. "The cat which my dog found of him"

Construction like Breton ar poatr a oan a kaozeal gantañ "the boy I was chatting with" do exist in Welsh as well: y rhaglen a gwrandais i arno "the programme I was listening to".

I agree I common annotation at least for the Celtic languages would be great, I'll go for a PRON and for acl:relcl of its head

tlynn747 commented 3 years ago

Hi @ftyers and @jheinecke - if i understand correctly you're saying that the instances of the relativisers "a" that you list above are more akin to examples 2 and 3 above?

Do you have similar uses to 1, 4, 5 and 6 at all?

And @ftyers, back to the other q - do you know of other languages that need to distinguish between Direct/Indirect in their features?

jheinecke commented 3 years ago

Hi @tlynn747, at least for Welsh examples like 2 and 3 occur frequently. Examples like 1) Nuair a chonaic mé "when I saw" would not have a a only a soft mutation: Pan welais i "when I saw" 4) "the writer who the students praise" would use y as «relator» yr ysgrifennwr y mae'r myfyrwyr yn ei ganmol (it's more a coordination than a relative pronoun; "The writer [,] the students praise him". But "the writer who I saw": yr ysgrifennwr a welais i. NB. there are just two quick examples, I did not have the time yet to search more thoroughly, cf. Gareth Kind: Modern Welsh, Routledge 1993, §485 I'll check for 5 and 6

kscanne commented 3 years ago

I'm closing this now since the Irish features have been added. Happy to discuss harmonization in an issue for the main project if anyone's interested.

ftyers commented 3 years ago

Yep, it would be good to continue in the main docs issues :)