UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
266 stars 242 forks source link

Constructions around colons #933

Closed amir-zeldes closed 7 months ago

amir-zeldes commented 1 year ago

There are a lot of possible configurations in which a colon separates two parts of an orthographic sentence. Currently there is a wide range of analyses for these, and I'm not sure how many distinct kinds should be recognized or how they should be annotated. Here is a brief overview of a few recurring types and how they are currently annotated in UD_English-GUM: (I have simplified some of the examples and reduced the head to 'root' even when embedded in something else, so please understand root to mean 'local root')

  1. Heading: subheading (mostly appos; should maybe be parataxis?)
    • a dissertation entitled "Continuity/xcomp: Study/appos in Methodology"
    • "Grit : Perseverance/appos and Passion for Long-term Goals"
  2. Category: members (should be appos?), or just a single appositive set apart by colon
    • Two languages: Gur/appos and Kwa
    • One word: possibilities/appos
    • 3 local nightclubs/root : L'instant/appos , Caraibes and Latin Club
  3. Key-value tables: (unsure about these, many examples currently behave like these have regular syntax)
    • Bats: Left/advmod; Throws: Right/advmod (of a baseball player who bats with the left hand but throws with the right)
    • tel/nsubj : 1/root 201/flat 944-3737/flat
    • fax/nsubj : 0590854959/root
    • e-mail/nsubj : turismo@merida.gob.mx/root
    • pronounced: Wootch/xcomp (of the name of the Polish city Łódź)
  4. Explicitation, or translation language (in these the corpus currently has the token after the colon as the head)
    • Greek/dep: Αθήνα/root, Athína
    • IATA/dep : IFN/root (specifying an airport's code in the IATA system)
    • Latin/nmod: Eusebius/root Sophronius Hieronymus (nmod is probably wrong here, other cases all seem to be dep)
  5. Attribution and sources (image credits and others; currently consistently left-to-right dep)
  6. Scores: (not sure if this is just a subtype of one of the above?)
    • Internet/root : 1/dep ; Scientology/parataxis : 0/dep

Are there any thoughts on which of these might be distinct/the same, and what the correct analysis should be? They are not 100% consistent in GUM, but most examples of each of these types do follow the pattern above (probably because annotators searched for existing cases and followed majority behaviors). I feel especially torn about ones like 3. above, where the content relation clearly saturates some valency relation (incl. manner adjuncts like 'throws right'), but the marking is unusual (esp. for subject-predicates).

nschneid commented 1 year ago

https://universaldependencies.org/u/dep/appos.html explicitly covers key-value pairs—where there is basically structured data rather than linguistic structure. Sports scores are arguably an instance of key-value pairs (like they would be displayed on a scoreboard). I'd probably handle (3)–(6) the same

nschneid commented 1 year ago

Agree that (1) headings should be parataxis

amir-zeldes commented 1 year ago

appos explicitly covers key-value pairs

I think the cases described there are of the more straightforward kind, where both the key and the value are nominals. This works well for cases like 2., but for things like "throws: right", we have several rather counter-intuitive consequences of using appos: one is the we attach a finite verb to something like an adverbial modifier of that verb with appos (so unlikely POS combination for an apposition) and another is that the construction is not reversible (?right: throws). Additionally, apposition suggests that the resulting whole is an NP, but I think this is more like a pro-dropped version of "(he) throws right(handed)", as evidenced by the finite VBZ form.

I can certainly apply an appos guideline robotically to these cases just because the colon is there, but is that really what we want?

I'd probably handle (3)–(6) the same

At least for 4. I have a problem with this, since I think the first term is easily omitable and therefore should not be the head. It's really just a modifier explaining the notation system from which the head comes:

So I think it should be ?deprel?(IFN, IATA)

Agree that (1) headings should be parataxis

This was also my gut reaction, but I noticed that these particular cases actually do fulfill the normal appos criteria (reversible, result in a single NP). I just feel that for headings which are not NPs we will have to have parataxis so I'm a little torn about it (e.g. "Star Wars: The Empire Strikes/parataxis back" seems the best option).

nschneid commented 1 year ago

appos explicitly covers key-value pairs

I think the cases described there are of the more straightforward kind, where both the key and the value are nominals.

I think the rule of key-value pairs is a special case that overlaps with but extends beyond the substitutable nominal cases. For example, if a list of business hours had:

Monday-Friday: 9am-7pm Saturday: 9am-noon Sunday: closed

That is not really a standard apposition construction, and "closed" is in no way equivalent to "Sunday", but it is a list of key-value pairs connected by list, and therefore should be appos as I read the guidelines.

This works well for cases like 2., but for things like "throws: right", we have several rather counter-intuitive consequences of using appos: one is the we attach a finite verb to something like an adverbial modifier of that verb with appos (so unlikely POS combination for an apposition) and another is that the construction is not reversible (?right: throws). Additionally, apposition suggests that the resulting whole is an NP, but I think this is more like a pro-dropped version of "(he) throws right(handed)", as evidenced by the finite VBZ form.

I think there are two readings. One is that it is a key-value pair occurring in a list, in which case see above. Another is as an elliptical sentence, in which case perhaps the colon could be ignored in favor of advmod(bats, left) or similar.

I'd probably handle (3)–(6) the same

At least for 4. I have a problem with this, since I think the first term is easily omitable and therefore should not be the head. It's really just a modifier explaining the notation system from which the head comes:

  • Isfahan airport (IATA: IFN)
  • Isfahan airport (IFN)
  • ?? Isfahan airport (IATA)

So I think it should be ?deprel?(IFN, IATA)

I suppose there is a pattern of the form CODE: EXPRESSION, where by "CODE" I mean the name of a language or other notational system, and the use of the colon is motivated by the key-value nature of this construction. I would say "EXPRESSION" functions as a metalinguistic mention in this pattern. In the standard apposition construction with substitutability it would be a referring expression rather than a metalinguistic mention.

Agree that (1) headings should be parataxis

This was also my gut reaction, but I noticed that these particular cases actually do fulfill the normal appos criteria (reversible, result in a single NP). I just feel that for headings which are not NPs we will have to have parataxis so I'm a little torn about it (e.g. "Star Wars: The Empire Strikes/parataxis back" seems the best option).

Even if in some cases there is semantic equivalence between the two parts, I think this is a broader heading-subheading construction, so parataxis seems like the better fit.

amir-zeldes commented 1 year ago

OK, I don't find any of that unreasonable per se, but I do want to hear what others think about this. I feel somewhat strongly about the 'code' cases having the content as a head, since it really seems like the language/code name is an optional modifier there. For headings I could be persuaded to do parataxis across the board (meaning even nominal cases are really parataxis of two independent fragments, I guess?). Although if we do that, I would think that should apply to key-value pairs too, since they are even less appos-like (less equivalent/reversible than parts of a title). I don't think that being connected by a list relation should figure into this though, since we need to be able to rule on cases that appear in isolation.

Another is as an elliptical sentence, in which case perhaps the colon could be ignored in favor of advmod(bats, left) or similar.

Yes, this is basically what I'm struggling with in those cases - but then how do we know we are looking at such a case? Some seem more blatant than others. Especially when the items are not tagged as nominals, I find appos really jarring, and an explicitly finite verb like 'throws' is probably the worst kind of example.

perrier54 commented 1 year ago

I studied this phenomenon on the French-GSD corpus annotated in SUD but the results are easily transposable to UD. In constructions in the form (TERM1:TERM2), the two terms TERM1 and TERM2 are linked by a dependency. The direction of the dependency depends on its type. I distinguished 7 cases that I transpose below to UD with examples extracted from the French-GSD corpus. These 7 cases partially but not totally overlap with the 6 cases highlighted by @amir-zeldes.

  1. TERM2 is a predicative complement of TERM1 Ex: La question que nous devons nous poser ici est : comment se fait-il que les Sephiroth soient androgynes ? (The question we must ask ourselves here is: how is it that the Sephiroth are androgynous?) fait –[cop]–> est; fait –[nsubj:outer]-> question @amir-zeldes case 3 is similar, except that the copula is elided. @amir-zeldes case 5 can also be filed with it, and even case 6. For the "Throws: right" example, I agree that Throws -[advmod]-> right is also eligible.

  2. TERM1 introduces a direct discourse expressed by TERM2 Ex: Domitien répliqua : « Si ce que je dis est vrai, que les temples s’écroulent (Domitian replied: "If what I say is true, let the temples fall down) répliqua –[ccomp]-> écroulent

  3. TERM1 and TERM2 are nominals referring to the same entity Ex: Boutique bobo chic mais il manque l’essentiel : le professionnalisme (Boutique bobo chic but lacks the essential: professionalism) essentiel –[appos]-> professionnalisme We can bring @amir-zeldes case 2 closer to this case.

  4. TERM2 is a phrase expressing a development of TERM1 Ex: Elle n’est pas surjective : son image est l’ensemble de Cantor (It is not surjective: its image is the Cantor set) surjective –[parataxis]-> ensemble We can bring @amir-zeldes case 1 closer to this case.

  5. TERM1 is an adverbial or prepositional modifier of TERM2 Ex: L’Oberliga Süd 1953-1954 (en français : Ligue supérieure de football d’Allemagne du Sud) (The Oberliga Süd 1953-1954 (in French: Ligue supérieure de football d'Allemagne du Sud)) Ligue –[nmod]-> français Ex: CZW’s (littéralement : L’évènement du Pistolet Agrafeur) (CZW's (literally: The Staple Gun Event) évènement –[advmod]-> littéralement We can bring @amir-zeldes case 4 closer to this case, even if the first term is not introduced by a preposition. In my opinion, this is the same for the example “Monday-Friday: 9am-7pm”

  6. TERM1 is a discourse marker of TERM2 Ex: Attention : il vaut mieux réserver le soir ! (Attention: it is better to reserve in the evening!) vaut –[discourse]-> attention

  7. TERM1: TERM2 is inside a foreign expression Ex: “X-Men:First Class”, réalisé par Matthew Vaughn ("X-Men:First Class", directed by Matthew Vaughn) X-Men –[flat:foreign]-> First

nschneid commented 1 year ago

Thanks Guy, those are good examples.

5. We can bring @amir-zeldes case 4 closer to this case, even if the first term is not introduced by a preposition. In my opinion, this is the same for the example “Monday-Friday: 9am-7pm”

I would not expect this type of colon to be standardly used within an (unquoted) subject, so I don't think it's same thing as ordinary syntactic adnominal modification.

In my view, one function of the colon is for key-value pairs that are external to a grammatical sentence, typically in a list (or metadata section of a document). Sometimes the colon can be interpreted in multiple ways, but if it's a standalone fragment with no predication (and not a title or foreign phrase), I would go with the key-value interpretation.

amir-zeldes commented 1 year ago

Thank you both for the examples and discussion - do I read the sentiment correctly that there is support for:

  1. generally using appos between nominals when the first is the head (key-value guideline)
  2. making the content the head for language/code specification, dominating the 'code specification' before the colon
  3. using transparent compositional syntax based on valency when the head is a verb

?

If these positions are supported, this leaves open:

  1. What is the deprel for the language/code situation? @perrier54 has suggested advmod (so I guess "Latin" is interpreted like "in Latin"?). I could also imagine using obl:npmod, and currently dep is used in English GUM. The main issue I have with advmod is that the validator would then force upos=ADV on nominals like "IATA" or "Latin"
  2. What to do about the time examples - @perrier54 seems to suggest advmod, resulting in the same POS problem, and I think @nschneid is saying appos by referring to the key-value guideline; I'm maybe more inclined towards parataxis, since I don't think it's an apposition, but I'm not confident about it
  3. Do we accept nominal predication using nsubj for nominals where a predication is intended (e.g. "Spouse: Kim Smith", meaning the spouse of someone is Kim Smith), or does this fall under appos too? If so, this would apply to a substantial subset of 'key-value', but if not, we are treating some predications as fragment NPs.
perrier54 commented 1 year ago

Regarding the last proposals of @amir-zeldes, I agree with the first three points, although for pairs (key, value), the alternative in terms of predicative complement is admissible. For example, as @amir-zeldes initially proposed, we can have e-mail/nsubj: turismo@merida.gob.mx/root because we can say that e-mail is turismo@merida.gob.mx.

Regarding the last three open questions, I take them point by point:

  1. I have not suggested advmod, i have only suggested that the first term is a modifier of the second; the precise label depends on the POS of the heads. For instance, for Latin: Eusebius Sophronius Hieronymus, the best choice is Eusebius -[nmod]-> Latin.
  2. The case of time examples is similar to the previous one. For instance, for Monday-Friday: 9am-7pm a reasonable choice is 9am -[nmod]-> Monday, because it is implied: 9am-7pm from Monday to Friday , considering that 9am-7pm behaves as a nominal, which is modified by Monday-Friday.
  3. I agree with accepting nominal predication and not apposition for the cases mentioned, because the colon plays the same role as the copula. Spouse: Kim Smith is equivalent to the spouse is Kim Smith
nschneid commented 1 year ago

2. The case of time examples is similar to the previous one. For instance, for Monday-Friday: 9am-7pm a reasonable choice is 9am -[nmod]-> Monday, because it is implied: 9am-7pm from Monday to Friday , considering that 9am-7pm behaves as a nominal, which is modified by Monday-Friday.

Maybe my example of "day: hours" combinations is confusing because temporal modifiers can often be expressed without a preposition (or there is an obvious preposition that can be inferred).

Suppose instead I am giving a list of team assignments for students:

John: Group 1 Kim: Group 2 Ann: Group 3

There could be many possible ways to rephrase these as sentences ("John is in Group 1" or "John belongs to Group 1" or "Group 1 includes John" or "The assignments include John in Group 1" etc.). But those paraphrases are not due to the colon notation—they're from world knowledge. The colon notation merely maps keys to values.

Does it make sense to treat these key-value pairs as headed nominal constituents? If so, would the name or the group be the head? It's not obvious to me.

These are definitely not appos by traditional apposition criteria like substitutability. If you force me to choose a real syntactic description I'd say it's paratactic rather than a regular head-dependent construction. However, since UD appos has a special use for key-value pairs, it seems to fit. (I.e., appos as a technical term is slightly broader than apposition per se.)

sylvainkahane commented 1 year ago

In case of lists, such as:

  John: Group 1
  Kim: Group 2
  Ann: Group 3

I think we should avoid purely semantic arguments trying to recover a possible paraphrase with words. If we only look to what we have, we have couples of NPs in a list. This is very similar to what we have in gapping coordination: John is assigned to Group 1, Kim Group 2, and Ann Group 3. Could be orphan a possible solution avoiding any over-interpretation?

nschneid commented 1 year ago

I see where you're going here, but I always thought orphan was restricted to certain predicate ellipsis constructions. Here it is just a syntactically loose connection between two elements.

There's a discussion of the scope of orphan in #635—the validator currently expects it to be headed by a conj dependent.

amir-zeldes commented 1 year ago

I think orphan does not solve the issues, since generally we would like to have empty nodes completing or paraphrasing the missing information in edeps, so at least for a dataset using edeps, if we cannot agree on the reconstruction or paraphrase, then we can't use orphan either.

For the 'loose connection' interpretation, I think I would also prefer parataxis, since that is the normal relation for things that are just standing next to each other with no other connection. Although the cases in the documenation such as e-mail addresses could be seen as appositions (if we don't thin they are predications), IMO it is really not an apposition in cases where the key and value do not corefer, or worse, are not even nominals.

sylvainkahane commented 1 year ago

Sorry, if I insist, but it our choice to decide what is the extension of orphan. We have different constructions where two phrases form a clause and there is no predicate linking the two phrases:

I think we have the same construction in these 4 cases and would like to have the same relation. In particular the order can be changed without a big change in the meaning (I suppose as a non native speaker; at least it is the case in French):

nschneid commented 1 year ago

There is a broader disagreement here: I am of the opinion that there is a divide between syntactic relations and other textual relations that might be signaled with punctuation etc., but are not really grammatical structures as they would be in spoken language. Not sure if the UD guidelines address this directly. I will open up a separate issue to try to articulate this and see what people think.

amir-zeldes commented 1 year ago

@sylvainkahane I'm not sure I agree with equating all of those examples since, as Nathan pointed out, we can have a variety of possible paraphrases:

("John is in Group 1" or "John belongs to Group 1" or "Group 1 includes John" or "The assignments include John in Group 1" etc.)

However I do agree that we should ideally distinguish predicative cases from nominals, which we should(?) be able to tell apart. For predications, no matter how they are expressed, I think appos is wrong.

sylvainkahane commented 1 year ago

@amir-zeldes It is exactly because many paraphrases are possible and thus we cannot even say which of the two phrases is the predicate and which is the argument that we can only say that the two phrases form a clause together: John and Group1 are associated in a particular way. We need a relation for that. For me, orphan was the best candidate because it is already used in gapping for that (even if in the case of the gapping a "source" can be reconstructed). Maybe it is not the place in this issue, but I would like to know what you (@nschneid and @amir-zeldes ) propose for my other examples (comparison and partial answer with two phrases) and whether you want a different analysis or not.

nschneid commented 1 year ago

Maybe French is more permissive with these constructions than English. The cleanest case of gapping is like:

If the first element of the list is outside of the coordination, it is more marginal (leaning on focus) and might call for commas:

While (1) is a clear case of orphan, I am not sure where (2) falls between orphan and parataxis.

The answer-to-question example and the comparison example remind me of (2). I would want to transcribe these with a comma, and regard the speaker as being extremely terse—whereas (1) is a well-established construction licensing the predicate omission.

All of those are arguably distinct from a (non-coordination) list of colon-separated pairs, which is a convention of writing or calling out items in sequence as opposed to constructing a grammatical sentence. It would be slightly unconventional to use colons in place of the commas in (2), and definitely unconventional in (1).

leky40 commented 1 year ago

I was reading all these, and getting curious. Would using punctuation in a text and in each language be different or the same? In these trackers, there are English and French samples using a colon. Would a colon be used the same in both languages?

I have been annotating several Thai texts for the Thai treebank, and I saw several texts and headlines using a colon and/or another punctuation more and more. It is quite challenging to annotate them with UD because the way they are used in the Thai texts might follow the way they are used in English rather than in Thai. This has caused me difficulty to choose which relation should be, and even to understand what they mean as used.

I checked the principles of using punctuation set by the Thai language authority. A colon in Thai is used in 3 ways:

1) It is used to mean or replace the words คือ (meaning "namely") or หมายถึง (meaning "to mean").

2) It is used after the phrases "ดังนี้" and "ดังต่อไปนี้" to list the things following. These two phrases mean "the following".

3) It is used with time, e.g. 12:30 .

I am not sure if a colon used in other languages would be used the same as the one in Thai or differently. I mean, in terms of writing English by a non-native speaker, when I use it in an English text, I need to check how they are used in English and what they mean as used.

So what I am quite curious is when we annotate a text with a colon and/or another punctuation used in each language, should we also consider how it is used in each language and also what it means as used in context? And if yes, would this impact on choosing the UD relation as annotating a colon with 2 parts separated and/or another punctuation?

FYI, in Thai, punctuation is not used frequently in a written text, and actually it is rarely used. Whitespace is not applied between words as like English and French. Plus, there are no capital letters.

My point here is that using a colon and/or another punctuation in one language is influenced by the way of another language could cause difficulty to read them used in a text and also hesitance to annotate them, as like I am having with the Thai texts which I have been annotating.

And I was wondering which part around a colon should be the head. This is from the statement that @nschneid mentioned above:

John: Group 1 Kim: Group 2 Ann: Group 3 There could be many possible ways to rephrase these as sentences ("John is in Group 1" or "John belongs to Group 1" or "Group 1 includes John" or "The assignments include John in Group 1" etc.). But those paraphrases are not due to the colon notation—they're from world knowledge. The colon notation merely maps keys to values. Does it make sense to treat these key-value pairs as headed nominal constituents? If so, would the name or the group be the head? It's not obvious to me

His samples are nominal constituents which, I agree, are not obvious. What about the 2 parts around a colon are not nominal constituents, but a verbal phrase and a sentence separated by a colon? Should the part before or after a colon be the head? I am having this difficulty to decide the head in my annotated Thai texts too. I cannot decide which one should be the head and which word should be the root.

Stormur commented 12 months ago

Just coming extremely late, but (mostly agreeing with the remarks/proposals by @sylvainkahane) I would like to add:

amir-zeldes commented 12 months ago

presence or absence of a colon should not change the analysis of some constructions

I don't think that's true without exception - sometimes a string is syntactically ambiguous and the colon can point to one analysis (I intuitively parse a "court martial" as an NP, but if it's "court: martial" it becomes something else)

conj:expl

Using a special subtype is indeed an option, but it would be a rather rare subtype so I'm not enthusiastic about doing it for English (and the string expl maybe makes people think of the expletive label, which is separate)

Stormur commented 12 months ago

presence or absence of a colon should not change the analysis of some constructions

I don't think that's true without exception - sometimes a string is syntactically ambiguous and the colon can point to one analysis (I intuitively parse a "court martial" as an NP, but if it's "court: martial" it becomes something else)

Yes, it points to something, and it can be a clue for some of those things that we can very difficultly represent in writing (e.g. prosody), but I meant it cannot be a decisive factor, nor mechanically applied!

conj:expl

Using a special subtype is indeed an option, but it would be a rather rare subtype so I'm not enthusiastic about doing it for English (and the string expl maybe makes people think of the expletive label, which is separate)

From our data it is not rare at all :-)