Closed SocratesVak closed 6 months ago
This sounds reasonable to me. Features can be defined as language-specific for Greek.
Τhese were data from two dialects and the Standard. More dialects are coming in and probably we will need to add features over time. Thanks indeed for the guidance.
I believe there are similar phenomena in other languages, which could perhaps justify a more universal addition.
Sometimes this means the addition extra characters (the lemma is the "full" form), as was described above for Greek dialects.
English: "an" is a euphonic form of "a" when used between vowels. There is a closed box. There is an open box. In Middle English there is a similar phenomenon with regard to possessive pronouns. This is mi closed box. This is min opene box.
German: Similarly with the indefinite article in various German dialects. Bavarian: A Må An Åmd. Yiddish: אַ מאַן אַן אָוונט
Romanian: A final "u" can be added to gerunds in certain cases. Oi fi mâncând mâncări. Oi fi mâncându-l.
Turkish: There is often an epenthetic "y" when adding a suffix starting with a vowel to am word also ending with a vowel: Otobüsü sürüyorum. Arabayı sürüyorum. There is an epenthetic "n" appearing between a 2nd or 3rd person singular possessive suffix and the definite article: Arabasını sürüyorum.
Other times this means the removal of part of a word (the lemma is the "reduced" form), which is also observed in Greek dialects.
Cretan Greek: Ιντά 'γινε. Δεν έγινε.
Italian: There are often shorter forms of words in the general vocabulary depending on the phonological environment (along with other considerations). Così fan tutti. Così fanno spesso. Va be' ragazzi. Li conoscevo bene. And in specific words: Parlo italiano e tedesco. Parlo italiano ed esperanto. Dallo ad Adamo. Dallo a Giacomo.
Romanian: Many clitics can remove vowels mainly according to their phonological environment. Îl ridic. L-am ridicat Initial "î" can also be removed from words in the general vocabulary in many cases: Nu-nțeleg nimic. înțeleg totul.
Spanish: The indefinite article as well as some adjectives only have an ending when not directly preceding the words they determine: El hombre es uno. Es un hombre. El hombre es bueno. Es un buen hombre.
There are also some other cases where these phenomena are only oral and aren't generally depicted in text, but may do so when dialectal material starts being worked on (as happened in Greek). These would include the pronunciation or not of syllable-final "s" in Spanish dialects, various final consonants and schwas in French (liaison), the final "n" after a schwa in Dutch, final consonants including "t" and "g" in Swedish, the schwa in Albanian (written in the standard) and Armenian (not written in the standard), the epenthetic "i" in consonant clusters in Brazilian Portuguese, and the tāʼ marbūṭa in Arabic.
Thanks for the overview. I guess I do not accept all the examples (for example, in Turkish, you don't really need a feature to distinguish sürüyorum from sürüorum because the latter does not exist, right?) but I agree that similar phenomena occasionally occur in various languages (FWIW, in Czech we use AdpType=Voc
to distinguish the vocalized (syllabified) forms of prepositions ve, se, ze... "in, with, from..." from their asyllabic forms v, s, z...).
What is unclear to me is 1. whether all such phenomena are "the same thing that should get the same annotation across UD", and 2. whether there would be sufficient demand to use them so that we should standardize them in the universal guidelines (for example, I don't think that English a is distinguished by a feature from an).
UD takes the bottom-up approach. Languages define their own features but occasionally, if a feature is used in several languages (or worse, if it is found that several languages have defined different labels for it), it may be promoted to the universal guidelines.
@andhmak those are some nice examples, thanks for listing them! I think there are definitely some commonalities, and many languages have environments in which sound adaptations may happen, but they can have rather different stories, and a lot of it is a matter of interpretation.
To some extent I think lemmatization already groups many of the forms you're proposing as euphonic variants, in that the lemma of say "a" and "an" is the same (and I assume the same is true for Czech "v" and "ve"). There are many reasons why lemmas have several forms, sometimes it's inflection, sometimes it's purely euphonic reasons, and sometimes it's something else, but I don't think those can all be lumped under one annotation.
For example, diachronically, "an" is not a euphonic version of "a" - if anything it's the opposite, the old form is "an", and the "n" is dropped unless the next word starts with a vowel. So we can decide that synchronically it is a euphonic variant, but that is not correct from a historic point of view. Similarly Slavic vocalized preposition forms are more or less the older form from when they were syllabic, and not an innovative variant. In some cases they even restore and re-analyze dropped consonants just like English "an": Polish "w nim" = "in it" comes from wn + im, and other prepositions do this analogically, e.g. "za nim" = "behind it", where the "n" has no etymological source. So here too, we could say that "za nim" is a case of euphonic change, but really it's about paradigm leveling and analogy.
I agree with @dan-zeman that these kinds of anlayses need to be done on a language by language basis because treebank designers have more context to make appropriate decisions to correspond to what users expect - for example I'm sure Polish users would be very surprised if the lemma of w/we "in" would be the long form, although it is older. Maybe the best solution is to document proposals in misc.html, and as Dan said, if many examples come together, we can argue to name their annotation key the same in all cases.
When working with dialects and a treebank of the "standard variety" is available --- as is the case of Modern Greek --- an easy, and probably transfer desirable, way to answer to the lemma question is to say that a certain form is a lemma in Standard Greek and the forms we have in the dialects are somehow affected by euphonics, phonological procedures etc etc. The use of MGloss is a symptom of this strategy. Of course if none of these solutions account for the forms we find, then a new lemma is defined that might be very similar to the lemma of the Standard, but we want to minimise the number of these cases. I just wait for more dialects and data to see how far this approach takes us. All advice and experience is welcome.
For an annotation project of Italian following a "UDoid" style I have been using the (not officially existing) feature Variant=Apoc
(for "apocopated") in MISC
. This is what happens in cases like
Così fan tutti. Così fanno spesso. Va be' ragazzi. Li conoscevo bene.
(Even if, to be nitpicking va be' is not really the same as va bene '(all) is well', and it is often univerbated as vabbe', but the syntactic words are those for sure.)
I think that probably this deserves to be annotated, alongside epentheses like ed instead of e in front of a vowel ("euphonic d"), especially since they are very often optional.
But I am still not totally convinced, because we might regard cases like fan and fanno, or more precisely the suffixes -(a)n and -(an)no as variants the same way that I would regard ü and yı in otobüsü and arabayı (of a morph "i⁴"). In one case they are in a rather free variations, while in the other they are not, so this might be a discriminant factor... but for all of these respective forms there is no viable ambiguity or alternative interpretation. And in the end, the same could be said for other epentheses. So I tend to agree with @dan-zeman and @amir-zeldes that these cases are in a sense already treated by lemmatisation and other features.
More in general, it seems to me that morphological features in UD want to annotate (most of the time) morphosemantic features rather than purely formal ones. So we are interested to annotate the variation of Case
, Number
etc. if it also takes place in the form, but not of purely formal variations. Referring also to the first post, probably we do not want them to get mixed in the same field: so all similar features, including Voiced
and VoicedLemma
as proposed in the first post, together with the already existing Style
, Variant
, etc., should be moved into MISC
, unless voicing etc. also implies the value of a grammatical category, in which case that value is marked as such. I would not annotate phenomena which do not occur in writing and/or do not have a "grammatical impact".
As for universalising similar features, I am pretty convinced this would be straightforward. These are occurring in every language and they can be described in rather simple terms (left- or right-extended [~euphonic], truncated, voiced...), interacting with canonical forms, synchronically. So this means that if a is the lemma in English, an is a "suffixally extended" form; and if w is the lemma of the preposition in Polish, then the pronominal form nim will be "prefixally extended" (if I understood it well). I doubt we can put too much diachrony in what I think is meant to be a synchronic annotation.
This issue might be related with #765
Certainly, the features we have defined so far (voiced, voicedlemma) have no semantic bearing. One main reason for using these features rather than using completely independent lemmatisation for each dialect is that we want to maintain a strong reference to the Standard without missing the dialectal information. Taking this argument to the extreme, the voiced and the unvoiced types could be assigned unrelated voiced and unvoiced lemmas respectively. Would that be wise given the small amounts of data available for these varieties? Alternatively, we could ignore these differences and assign identical annotations to voiced and unvoiced (and truncated and...) types. But then, the annotation of the dialects would not encode many important differences between them and the Standard. Wouldn't those models be biased and return funny results? All advice is welcome. Lastly, these features form a family of their own, it seems to me. As regards the documentation of Greek dialects, we can make a special section in [greek-misc] and describe the features there and consequently, encode the features on the 10th column of CONLLU. In practice, they will turn up as features of the lemma and/or as morphological features.
I am not sure I am grasping the problem here.
I am in favour of using a standard lemma, if this can be identified, as you suggested in your post. Variation in form would stay there, I do not think information would get lost if it is present in the spelling. In a sense, it could be better to have the exact same annotation Case=Nom|Gender=Neut|Number=Plur
for μπράγματα and πράγματα and then discovering this variation across or internally to treebank(s)... no? I do not think that a model would learn differently if we put Voiced
etc. notations under morphology, as these seem more directed towards human queries to me.
One main reason for using these features rather than using completely independent lemmatisation for each dialect is that we want to maintain a strong reference to the Standard without missing the dialectal information
We are facing a similar situation with a treebank of Bohairic Coptic that we're working on, and our approach currently is to use the lemma field to refer to the Bohairic dialect lemma (so, the uninflected base form, but as it is used in the dialect), and a separate MISC annotation called HyperLemma
to store the corresponding form in the "standard" classical dialect, which is Sahidic Coptic.
I think always using the standard dialect form as the lemma is problematic, first because then you don't have the dialect lemma anywhere, and second because some dialectal words don't actually have a corresponding lemma in the classical dialect, so we don't want to invent a Sahidic lemma for a Bohairic word that is unattested in Sahidic. I would guess the same would apply to Greek - for example for Aeolic:
Not sure if that's helpful, but at least for Coptic where there are quite a few distinct words across the dialects, this seems necessary for our use case.
By now it seems to me that there is enough necessity, or usefulness, for multiple lemmas (hypo-, hyper-, standard...) in enough projects that this would need to be made standardised/implemented somewhere in UD's format and guidelines.
By now it seems to me that there is enough necessity, or usefulness, for multiple lemmas (hypo-, hyper-, standard...) in enough projects that this would need to be made standardised/implemented somewhere in UD's format and guidelines.
There is some demand and potential future work on lemma-related issues in the UniDive WG2 Task 2.1 (@osenova). Sounds like an excellent opportunity to look at this, too.
First of all, thanks for the discussion and your time. Second, please, let me clarify a couple of points that may not be obvious.
Voicing in Modern Greek (the Standard and many dialects) is a regular phenomenon in certain environments and there are orthographic conventions regarding its representation; we adopt these conventions in the dialects that have the phenomenon and do not add any annotation. Below, we call this type of voicing "expected voicing". If unexpected voicing occurs, then different orthographic conventions are used to represent it because we think that there is a different phenomenon there that characterises the dialect.
Certain dialects use only voiced types of unvoiced lemmas of the Standard. For instance, Lesbians say η μπατρίδα (batriδa).singluar.nominative and the lemma of the Standard is η πατρίδα (patriδa). We use for the Lesbian dialect the lemma μπατρίδα with no special annotation that would link it to the unvoiced lemma πατρίδα of the Standard.
Problems occur with dialects that use both unvoiced and unexpected voiced types. So in East Cretan we have: A. Unvoiced and voiced lemma forms, e.g., πράγμα (pragma) and μπράγμα (bragma) in the singular, nominative after the article which is the typical lemma test for nouns and adjectives; the lemma of the Standard is unvoiced (πράγμα). We consider that the voiced lemma and the unvoiced lemma co-exist in the dialect and use the VoicedLemma feature -as a morphological feature- to mark the instances of the voiced one. Both types are assigned the unvoiced lemma. Of course, we could have chosen to assign the voiced lemma instead and even use an "Unvoiced" feature but this solution seemed less good as far as model development was concerned. Lastly, we could use no features at all and one lemma and assume that this dimension of the dialect is adequately captured at text level. Again, we thought that model-wise one lemma plus features would be a more informative solution.
B. For postnominal pronouns the VoicedLemma feature does not make sense because they form a super irregular morphological paradigm with lemma εγώ. We find expectedly voiced, expectedly unvoiced and unexpectedly voiced pronouns and we use the Voiced feature -as a morphological feature-- to mark the unexpectedly voiced pronouns. Voiced and unvoiced pronouns are assigned the lemma of the Standard. Again, we could avoid using a feature for voicing.
If we assume that these unexpectedly voiced lemmas and pronouns are the result of some ongoing lexicalisation procedure then unexpected voicing can be considered part of the morphological identity of the word and not the result of a regular procedure (that would be captured by the orthographic conventions). So, we went for morphological features.
Euphonics, on the other hand, occur as expected and are all annotated in the 10th column.
On Fri, Apr 12, 2024 at 12:01 PM Flavio @.***> wrote:
By now it seems to me that there is enough necessity, or usefulness, for multiple lemmas (hypo-, hyper-, standard...) in enough projects that this would need to be made standardised/implemented somewhere in UD's format and guidelines.
— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/1017#issuecomment-2051341861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVBQPUCWJ4UCM5KI3XM5T23Y46PFNAVCNFSM6AAAAABEBK4YOCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJRGM2DCOBWGE . You are receiving this because you commented.Message ID: @.***>
Definition of voicing and euphonics
Below we distinguish between voicing and euphonics and annotate them as phenomena of a different type. Βoth voicing and euphonics are due to the influence of the phonetic environment. They change the utterance without affecting its syntactic structure and meaning.
Standard Modern Greek uses voicing extensively. Colloquial Standard Greek uses euphonics as well, as the main tendency is to use open syllables of the form consonant+vowel. Both voicing and euphonics are intensely used in Greek dialects.
Voicing in Greek is a phonological phenomenon where given the sequence of two words, the initial unvoiced consonant (/ts/, /t/, /p/, /k/) of the second word is voiced, e.g., /tsi/→/dzi/, /t/→/d/, /p/→/b/, /k/→/g/. There are several descriptions of the phenomenon with different consequences regarding the linguistic analysis of the first and the second word in the sequence but they do not affect the discussion here which is only about the representation of the contrast voiced/unvoiced in a UD treebank.
In contrast, euphonics are sounds that are added following the phonological procedure of epenthesis, in order to avoid both the hiatus (produced by vowel sequences), e.g., /ˈu.te ˈom.bja.se/ –> /ˈu.te ˈn om.bja.se/ and sequences of consonants, e.g., /ˈan ˈθe.ʎi/ > /ˈan e ˈθe.ʎi/. In all cases, the result of the epenthesis are two open syllables of the type consonant+vowel.
The representation of voicing and euphonics in the treebank is the topic of this issue.
Voicing in the treebank
The orthographic representation of voicing in Modern Greek follows the following widely used conventions:
1. If an article with the features Case=Accusative and Gender=Masculine or Feminine is followed by another word whose first consonant is voiced —although its lemma form does not have a voiced first consonant—, then a “-ν” is added to the article, e.g.
2. In all other [word1 word2] sequences where word2 appears with a voiced first consonant (while allomorphs with a non-voiced first consonant are attested) and word1 is not independently found with a final “-ν”, voicing is represented on word2, e.g.
Naturally, the encoding of voicing as text is subject to the orthographic conventions of Modern Greek that represents the sounds /dz/, /d/ and /b/ with two characters because it has no single character for them.
Often, word2 is a weak personal pronoun used to express possession or it functions as a clitic related to the object and indirect object syntactic dependencies.
These voiced weak pronoun forms coexist with their unvoiced counterparts. Both voiced and unvoiced forms receive the same lemma, namely that of the personal pronoun. Appropriate annotation is necessary to distinguish between them on the grounds of the property “voiced”.
We cannot resort to the MSeg representation in order to encode voicing in a UD treebank. This is because voicing is a phonological procedure that only changes the first sound of a word without adding any new sounds or changing its morphosyntactic features. The results of voicing cannot be separated from the rest of the word in the form of some type of affix. For instance, “τζη” (/dzi/) cannot be divided as “τζ-η” because “-η”(/i/) is not a word with the same morphosyntactic features as “τζη” (recall that voicing has no morphosyntactic effect). Similarly, in the case of the /tu/→/du/ voicing, whose results are encoded as “του”→“ντου” in the established writing system, we cannot write “ν-του” because we divide the representation of a single sound. The splitting “ντ-ου” is also wrong because it produces the non-existing word “ου”(/u/).
In the light of these facts, we seem to need a diacritic lexical feature that differentiates the unvoiced variant from the voiced one. We propose the definition of a new morphological feature called “Voiced” with values “Yes” and “No”. While “No” can be considered the default value and can be omitted, “Yes” should not be omitted. We have opted for a morphological feature rather than a MISC one because voicing affects the form of a word.
Some challenging cases of voicing
dialect of East Crete:
“τα μπράματα” (/ta ˈbra.ma.ta/) coexisting with “τα πράματα” (/ta ˈpra.ma.ta/)
dialect of Lemnos:
“η μπατρίδα” (/i ba.ˈtri.ða/) coexisting with “η πατρίδα” (/i pa.ˈtri.ða/)
In both these examples, in the same dialect both the voiced and the unvoiced version of a word are used in sequences where no voicing is expected. This means that the voiced version is lexicalised and competes with the unvoiced version which is considered the “original” one, especially if it appears in the Standard. The problem here is which lemma is assigned to the two versions. We propose that they are both assigned the unvoiced version of the lemma, since this is likely to be the lemma in the Standard, and the voiced form is assigned the feature-value pair “VoicedLemma=Yes”. Again the value “No” is the default one and may not be declared.
Euphonics in the treebank
Εuphonics, on the other hand, clearly are vowels or consonants that occur between words (example 5a,b) or at the end of a word (example 5c).
Their textual encoding is an issue. They cannot be encoded as orthographic words because they have no morphosyntactic properties. Traditionally, in Standard Modern Greek, some euphonics are attached to some words (example 5c). We had to define additional guidelines in order to encode dialectal Greek. In all cases, we have attached euphonics to the word that precedes or follows them according to specific conventions. Because of the way they are encoded, euphonics look like affixes, but, as it has already been made clear, they affect neither the morphosyntactic status of a word nor the meaning of the construct where the word occurs. We propose the MSeg representation and the label “euphonic” for encoding euphonics in the Greek treebanks.
Notice that example 6 from the dialect of Eastern Crete contains a euphonic that is encoded with two characters “γι” because the Greek alphabet does not have a dedicated character for the sound /ʝ/. Several such euphonics are found in the many Greek dialects. We could probably use non-Greek characters for these euphonics, for instance in example 6 we could use “j”. At the moment, since dialectal treebanks are in close connection with GUD, the treebank of Standard Modern Greek, we prefer to use the same alphabet.