UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

Apostroph Tokens #765

Closed jheinecke closed 3 years ago

jheinecke commented 3 years ago

Welsh, as the other Celtic languages, has initial consonant mutation, which are either triggered by a preceding preposition, negation or pronoun or by syntactic function (e.g. temporal adverbials). Sometimes the triggering word is shortened to "'" or disappears, but the mutation is still there, which indicates the inherent presence of the absent word. How should we annotate this, e.g. achos 'mod i'n ddi-waith ers ychydig ("because I'm unemployed since a short time", lit. "because my being unemployed ...") The token which disappears is fy (here my) in fy mod but the mution mod instead of bod is still there and indicates the subject (1st Sing). I guess it could be annotated like dropped pronouns in languages like Spanish (canto "I sing"). But how to we tokenize and annotate the apostroph?

1   achos   achos   SCONJ              (because)
2   '   fy  PRON               (my)
3   mod bod NOUN    verbnoun   (being)
colinbatchelor commented 3 years ago

There is an old convention in Scottish Gaelic where if the particle before the verbal noun is dropped, this is indicated with an apostrophe at the beginning of the verbal noun, like this (from sentence ns03_004 in the Scottish Gaelic treebank)

12  bha bi  VERB    V-s Tense=Past  9   acl:relcl   _   _
13  '   ag  PART    Sa  _   14  case    _   SpaceAfter=No
14  tadhal  tadhail NOUN    Nv  VerbForm=Vnoun  12  xcomp:pred  _   _
15  air air ADP Sp  _   17  case    _   _
16  an  an  DET Tdsm    Gender=Masc|Number=Sing 17  det _   _
17  ospadal ospadal NOUN    Ncsmd   Case=Dat|Gender=Masc|Number=Sing    14  obl _   _

written bha 'tadhal air an ospadal and meaning "were visiting the hospital".

If it's absent in the text then I don't mark the absence, just as the relative particle a is sometimes dropped, especially in speech.

manning commented 3 years ago

Tokenization is an area where UD doesn't have a strong standard, only vague guidelines – maybe that should be changed! – so it's sort of up to you what seems most sensible. But what you suggest looks completely fine.

My Celtic knowledge is limited to reading a few linguistics papers 25 years ago, but I think the two viable choices in UD would be to have it just as you do or to regard 'mod as a single word/token, which would then have the possessive expressed using morphological features. However, since in other cases you do get fy or 'y, going with the analysis you suggest seems the most sensible option to me.

Stormur commented 3 years ago

From the description of it in Welsh, it seems that the apostrophe is used conventionally to mark an effect on the following word: so I understand that a mod from bod would not be justified alone in that context, and so 'mod signals that something else has to be implied.

I feel uncomfortable with the idea of detaching ' from a word and treating it as a placeholder for something that actually is not there... if this kind of ellipsis does not create a gap in the sentence structure but is just peripheral, I would just keep it as a single token and adjust its features accordingly, as @manning says, and consider that the implied element simply is not present. I mean, I am not sure if a graphical convention warrants to make a token out of it: I would leave what pertains to the token together with the token.

jheinecke commented 3 years ago

'mod is not standard orthography. It should be fy mod but in speech or in a more colloquial style the fy is often omitted, but the mutation is kept. So it is not the apostrophe which triggers the mutation, but the word for which it stands

ftyers commented 3 years ago

Same in Breton :)

# sent_id = war-bont.vislcg.txt:2:28
# text = Me 'welas ur verjelenn war ar pont o ouelañ.
# text[fra] = Je vis une bergère qui pleurait sur le pont
# labels = to_check song
1   Me  prpers  PRON    prn Case=Nom|Number=Sing|Person=1|PronType=Prs  3   nsubj   _   _
2   '   a   AUX vpart   _   3   aux _   SpaceAfter=No
3   welas   gwelout VERB    vblex   Number=Sing|Person=3|Tense=Past|VerbForm=Fin    0   root    _   _
4   ur  un  DET det _   5   det _   _
5   verjelenn   berjelenn   NOUN    n   Gender=Fem|Number=Sing  3   obj _   _
6   war war ADP pr  _   8   case    _   _
7   ar  an  DET det _   8   det _   _
8   pont    pont    NOUN    n   Gender=Masc|Number=Sing 5   nmod    _   _
9   o   o   AUX vpart   _   10  aux _   _
10  ouelañ  gouelañ VERB    vblex   VerbForm=Inf    3   advcl   _   SpaceAfter=No
11  .   .   PUNCT   sent    _   3   punct   _   _
Stormur commented 3 years ago

'mod is not standard orthography. It should be fy mod but in speech or in a more colloquial style the fy is often omitted, but the mutation is kept. So it is not the apostrophe which triggers the mutation, but the word for which it stands

Yes, this is clear! I was not meaning that the apostrophe is causing the mutation, but that it just (maybe uncanonically) marks this otherwise unjustified mutation on bod<>mod, rather than replacing the omitted fy.

I see many similarities in the way that Italian, French, English, and many other languages, use ' to mark some variations in pronunciation/spelling, or also deviations from the standard in colloquial/dialectal variants; e.g. canonical it. l'albero instead of lo albero 'the tree', or it. 'ngiorno or even 'giorno instead of buongiorno 'good morning'. That is, it is only an orthographic convention which might or might not be there: we might just write l albero, or even lalbero, or l-albero Maltese style, or ngiorno or giorno and so probably one could just write achos mod... , right (maybe in a hasty, sloppy style or else)? From what I have understood, the fact that ' is not really standing for any word, but only for the effect of a dropped one on another word, would make me lean towards not "restoring" the missing word and keeping the token together. Maybe I am missing something, but I hope my point is clearer now! :slightly_smiling_face:

sylvainkahane commented 3 years ago

I agree with @Stormur. Working mainly on spoken languages, I don't really like orthographic tricks and I think that what must be analyzed is the language itself, which is spoken, and not its orthographic transcription. Here we have one word, /mod/. It doesn't matter if we write it mod or 'mod, there is one word. Now there are two ways to analyze this word, as a form of bod, or as an amalgam of fy and bod. In the latter case it looks quite similar to your first solution, where fy is the lemma of ', but it is not exactly the same thing, because we don't introduce a zero word.

colinbatchelor commented 3 years ago

I have thought about this and I've realised I've been inconsistent. ARCOSG tags the prefixes h- and t- and n- and dh' as separate tokens, but I ignored this and combined them with the words they prefix as per the Gaelic Orthographic Conventions.

I think the correct thing to do is to treat 'dol as a single word. I will sort this out next.

jheinecke commented 3 years ago

Thanks for all your comments. So I think the most consistent solution is then tokenize ' as form of fy if present. If absent, no empty word or what so ever. @colinbatchelor what does the apostrophe in 'dol stand for?

colinbatchelor commented 3 years ago

Thanks for all your comments. So I think the most consistent solution is then tokenize ' as form of fy if present. If absent, no empty word or what so ever. @colinbatchelor what does the apostrophe in 'dol stand for?

It indicates that someone would have said a' or ag had they been speaking more slowly or more carefully or wanted the poem to scan differently.

ClaudiaCorbe commented 1 year ago

Hi,

while annotating a literary text in Old Italian, I came across some problems with the tokenization of some prepositions with apostrophe, where the apostrophe has the function of signaling the absence of another element, namely the article, as in:

a' nemici = ai nemici (to the enemies)

I was wondering how to deal with similar cases: should I tokenize and analyze ' as the article (see below case 1) or should I consider a' as a simple preposition "a" (see below case 2)?

a' nemici
1-2 a'  _   _              
1   a   a   ADP             
2   '              

ai nemici (case 1)
1-2 ai  _   _              (to+the)
1   a   a   ADP               (to)
2   i   il  DET    (the)

a' nemici (case 2)
1   a   a   ADP               (to)

As considering an Ancient Language (Old Italian), the punctuation has been added by the editor: the manuscripts of the text have a scriptio continua.

dan-zeman commented 1 year ago

I suppose that the first question is whether you want to go by the original manuscript, or trust the editor and keep the apostrophe (which I understand not so much as a question of the punctuation but rather of the existence of the unexpressed article as such).

Then if the apostrophe is there and if a' is to be understood as ai, the following annotation is expected:

1-2 a'  _   _   _   _   _   _   _   _
1   a   a   ADP _   _   3   case    _   _
2   i   il  DET _   Definite=Def|Gender=Masc|Number=Plur|PronType=Art   3   det _   _

Note that the form on the "1-2" line must correspond to the surface form. Hence, your "case 1" can only be applied if the underlying text has the full ai, not just a'. Even then I believe that you swapped form and lemma of the article, so it should actually be

1-2 ai  _   _   _   _   _   _   _   _
1   a   a   ADP _   _   3   case    _   _
2   i   il  DET _   Definite=Def|Gender=Masc|Number=Plur|PronType=Art   3   det _   _

Also note that the multi-word token mechanism does not require that the forms of the syntactic words within the MWT are proper substrings of the surface form of the MWT. Therefore in both the annotations above the actual underlying forms a and i are shown, despite the surface form, which is a' in the first case and ai in the second case.

Stormur commented 1 year ago

I think that this is again one case in which punctuation is misleading and of what I would describe as "prescriptive zeal" by part of an editor.

That is, we observe that, in this variety of Italian, due to phonologic processes the word ai /ai̯/ has been reduced to a /a/, and so the definite article is simply not present anymore; but in a more "standard Italian" we expect it to be there, and so the editor uses a mark like ' to show that "there should be an i there". But there isn't anything! This apostrophe is simply marking an absence. So I would just propose an annotation as

1   a'  a   ADP _   _           3   case    _   _
2   nemici  nemico  NOUN    _   Gender=Masc|Number=Plur n   obl _   _

I would even go so far as to detach the apostrophe from the adposition in tokenisation. Anyway, in general I would not annotate what is not there, in a totally similar way to the discussion about Celtic languages at the beginning of this issue.


Now, an interesting issue that we were reasoning about is if in this variety of Italian sequences like a nemici (indefinite) and a' nemici (definite) are indeed distinguished somehow. My suspect was that raddoppiamento sintattico ('syntactic gemination') might be involved here. Briefly, some words in some varieties of Italian trigger a geminated pronunciation of the initial consonant of the following word: the preposition a 'to' is one of those (owing to its origin from Latin ad). So I wonder if we have a "minimal couple"

I am not sure where this could be annotated, maybe under MISC as SyntacticGemination=Yes? It is interesting in general because syntactic gemination makes a difference, but probably has to be annotated at a different level than morpholexical features (and it is quite language-specific). From the little I have seen going thorugh Italian treebanks, syntactic gemination is not annotated in any of them.

In sum, it might be that there is still a difference between a definite and an indefinite construction, but it has shifted from a morphological level (presence/absence of DET i) to a purely lexical + syntactic level, distinguishing a geminating a from a non-geminating a. Under this light, we could actually think to associate Definite=Def to the latter.

francescomambrini commented 1 year ago

@ClaudiaCorbe I would defintely recommend Dan's solution:

1-2 a'  _   _   _   _   _   _   _   _
1   a   a   ADP _   _   3   case    _   _
2   i   il  DET _   Definite=Def|Gender=Masc|Number=Plur|PronType=Art   3   det _   _

This a' is certainly = ai, 'a' + 'i', as it is still quite normal, by the way, in moder-day Tuscan varieties of Italian. The definite article is elided for phonological reasons, but is clearly there from the standpoint of syntax! Not differently than the regular ending -ai of the 1st person past ("passato remoto") in:

Tacette allora, e poi comincia' io (Dante, Inferno 2, 75-6)

I would certainly want your passage to show up in the results if I queried for all the noun phrases with definite articles!

Stormur commented 1 year ago

but is clearly there from the standpoint of syntax!

Are you also thinking about syntactic gemination or is there something else?

francescomambrini commented 1 year ago

I'm not sure about the gemination, because I am not a specialist in that field or language variety. But I'd say that it is "there" exactly like the ending -ai is "there" in the comincia' of the passage quoted: a' (= a+i, preposition + article) and a (simple preposition) are two different words that just happen to be homographs in that specific passage on account of a "phonotactic accident". Proof be that the editors feel the need to use two different spellings.

Anyway, that's also the point. I don't know if the modern editors are being "overzealous", as I am not an expert of Old Italian philology. But you have to respect the standard modern editorial norms: apostrophe means elision. Either you read a' and that means that you take it as an elided ai and annotate accordingly, or you don't and just read a. Reading a' (with mark of elision) and annotating as simple ADP doesn't make much sense, IMO.

Stormur commented 1 year ago

One problem of this elision/apocope is that it does not leave anything behind and I find this to be the trickiest point as it was for the Celtic examples (which I have to through again).

This of ai/a' is different from other cases like (I am making this specific example up but @ClaudiaCorbe will surely suggest an actual one from the Commedia) andaron 'they went' instead of a "more standard" andarono: here we can see just a variation between the suffixes -arono vs. aron, but the suffix is still there. The case of comincia' io is indeed more interesting.

A point which I think should be investigated to "solve" the case ai/a' is the distribution of this different forms: if we see that this elision/apocope is always happening in the same contexts, then I would also be indeed inclined towards the multiword analysis of a'. But if it has other, maybe less predictable patterns, or always happens in all contexts, I would be less sure of how to annotate it. The verbal form comincia' io is probably more straightforward because it looks like a contextual alternation (triggered by -i i-), as you note.

PS: OK, let's not talk of "overzeal", but maybe of "overinterpretation" :slightly_smiling_face:

ClaudiaCorbe commented 1 year ago

Hi,

Thank you very much for your comments and suggestions. I need to carefully examine whether the cases without apostrophes are contextually motivated or simply "graphic fluctuations," as the following examples seem to suggest (or are they due to the metrical pattern?):

faceva ai piè (Inf. XVI v. 27) vs il volto a’ piè (Inf. XXXIV, v, 15) dinanzi ai tre (Inf. IV v. 87) vs quando vengono a’ due punti (Inf. VII, v. 44)

I will definitely go with the solution suggested by @dan-zeman, namely

1-2 a'  _   _   _   _   _   _   _   _
1   a   a   ADP _   _   3   case    _   _
2   i   il  DET _   Definite=Def|Gender=Masc|Number=Plur|PronType=Art   3   det _   _

Also, I have come across cases of apocope that seem to be more tricky. I'm referring to instances where the final vowel, which may be considered to indicate the plural/singular form, is deleted. What do you suggest I do in these cases? Should I avoid introducing the feature "Number," even if it can be inferred from the context (such as agreement with the article/adjective/verb), or should I include the "Number" feature?

"ma i demon che del ponte avean coperchio" (Inf. XXI, v. 47)

francescomambrini commented 1 year ago

Hi Claudia, I'd say you don't need to review if the cases are contextually motivated for a few of good reasons (it doesn't hurt, obviously, but I'd start from elsewhere):

  1. This type of apocope is (as I said) typical of modern Tuscan dialects as well: at my my uncle's village people would normally say "a' bimbi" when it is absolutely patent that they mean "to the children" (definiteness added); yes, I'm sure of it, I'm a native speaker :-) I am also sure (but, again, I'm not an expert) that tons of dialectologists have already studied this before us; I'd start from there.

  2. The idea that a poet is bound to consistency (phonetic, graphical, linguistic, whatever) is very dangerous and should be handled with a huge load of care. This has lead to disasters in the critical editions of Greek and Latin authors. Of that I am an expert.

  3. But more importantly (and to me, this is a real killer argument): the principle that we must respect the text that we are annotating is non-negotiable. And in this case it means: either you edit your own text and leave the apostrophe out, or you annotate it for what it is supposed to mean. There is no third option here.

On Fri, May 19, 2023, 8:42 PM ClaudiaCorbe @.***> wrote:

Hi,

Thank you very much for your comments and suggestions. I need to carefully examine whether the cases without apostrophes are contextually motivated or simply "graphic fluctuations," as the following examples seem to suggest (or are they due to the metrical pattern?):

faceva ai piè (Inf. XVI v. 27) vs il volto a’ piè (Inf. XXXIV, v, 15) dinanzi ai tre (Inf. IV v. 87) vs quando vengono a’ due punti (Inf. VII, v. 44)

I will definitely go with the solution suggested by @dan-zeman https://github.com/dan-zeman, namely

1-2 a' 1 a a ADP 3 case 2 i il DET Definite=Def|Gender=Masc|Number=Plur|PronType=Art 3 det _

Also, I have come across cases of apocope that seem to be more tricky. I'm referring to instances where the final vowel, which may be considered to indicate the plural/singular form, is deleted. What do you suggest I do in these cases? Should I avoid introducing the feature "Number," even if it can be inferred from the context (such as agreement with the article/adjective/verb), or should I include the "Number" feature?

"ma i demon che del ponte avean coperchio" (Inf. XXI, v. 47)

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/765#issuecomment-1555084606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB334NHJ7FZITFM63ZT4RIDXG65H7ANCNFSM4XHX35SQ . You are receiving this because you commented.Message ID: @.***>

Stormur commented 1 year ago

I will try to make some points I mentioned earlier clearer, after some discussion I had about the topic with other persons.

The main one is: I think that the possible annotation of the definite article form i in the apocopated a' should be left for enhanced dependencies as an "elided token". Putting it in the "basic dependencies" would mean to give substance to a sign which denotes the absence of something, so, against UD principles, to "annotate what is not there". Also, this is not a case of contextual elision.

The second is: the absence is intended to be of an article form, not of a mark of definiteness. But this marking apparently takes place through other means (first suspect is syntactic gemination), which should deserve to be annotated at some level somehow. The attested Definite=Def also needs to be accomodated somewhere, and the most reasonable place (in the basic annotation) should be the preposition itself (so we have something like a "definite preposition" against an "indefinite", or maybe "neutral", one).


Now, this comes from a "paradigmatic point of view" which I would like to put forward. I suspect it is misleading to consider definiteness necessarily tied to the presence/absence of an article. What we are considering here is rather the construction of functional elements, in this case ADP (with case), plus Definiteness. So, while we see that in the cells of this "paradigm" most of the time we have it expressed as an article (al, alla, a un, maybe a degli and so on... but also not explicitly expressed as in a possible a persona_ 'to [any] person'), it happens that under (apparently phonologically motivated) circumstances, in this variety, definiteness can be conveyed through other means. In sum: the definite article is not necessary if the definite construction can work in other (possibly alternative) ways, too.

I think that this approach does respect the text. Tokenisation and anything else is not altered; further, it allows us to represent something which this orthographic notation is exactly a spy for: the apostrophe is telling us that while we should think of this case as analogous to an occurrence of ai, it still is not the same, something else is happening here (and in other cases like ne', co', e ' ... ). We are not annotating different things in the same way (referring to another principle of UD)!