cldf-clts / clts

Cross-Linguistic Transcription Systems
https://clts.clld.org
13 stars 3 forks source link

Dummy Marker Ø #100

Closed LinguList closed 3 months ago

LinguList commented 3 years ago

My future code will require the extensive use of dummy markers that extend the source/target construct. This can also be used for allophonic variation, e.g., e a being phonetically e j a, so if I see this in the data, but want to mark in an alignment, that one should ignore the j, I can write e j/Ø a. Or if I want to add it, I can write e Ø/j a. Having the Ø as a symbol seems useful, as it would otherwise resolve as a wrong sound in an evaluation, although it is intended.

xrotwang commented 3 years ago

See this related CLDF issue: https://github.com/cldf/cldf/issues/93#issuecomment-713341579

LinguList commented 3 years ago

Thanks I knew we had this discussion somewhere. Is it okay if I go ahead in testing this during the next time?

xrotwang commented 3 years ago

Yes!

Johann-Mattis List @.***> schrieb am Mo., 29. März 2021, 11:35:

Thanks I knew we had this discussion somewhere. Is it okay if I go ahead in testing this during the next time?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cldf-clts/clts/issues/100#issuecomment-809231380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKEZ7EHIIJKTPR7U4YDTGBCWFANCNFSM4Z33LYQQ .

xrotwang commented 3 years ago

I'd lean towards having this null phoneme live only in pyclts - but I'm not sure. Should it exist in BIPA? I.e. have a row in the relevant tables? And if so - what are the features?

xrotwang commented 3 years ago

It might fit in here: https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/markers.tsv

LinguList commented 3 years ago

Yes, I was thinking of a marker, and just give it zero features.

LinguList commented 3 years ago

What I am experimenting with now is the following: I need the zero-marker for now not for CLDF representation but for experiments on improved alignment and sound correspondence approaches. The source/target construction so far allows me to work with the data and the Ø strings in EDICTOR and the like, but for CLDF conversion, I want to share the data from the book, not my interpretation on it, so I can then use a simple conersion routine by which all what was added by myself is reverted:

                Segments=[y for y in [x.split('/')[0] for x in wl[idx,
                    'tokens']] if y != "Ø"],

This means, I can render a string that was given as ŋ (indicating syllabic nasal) as ŋ Ø/ə, for alignment purposes, and later revert to ŋ when sharing CLDF data (which is following the source).

In this sense, one could even live without the marker, but a function (in lingpy or linse) that makes this operation transparent may be useful. So far, I only use it in liusinitic, where I have also cases of allophony, which prevent me from finding cognates, so I re-write the allophones to be identical with their other parts (consider h a n t vs. h a n t/d in German).

As this kind of analysis is not yet really explored, it is merely testing its suitability now and should not be placed into the CLDF datasets by now, I think.

xrotwang commented 3 years ago

After a bit of thinking, I'd say a marker in BIPA would make most sense. After all, that what this is, a marker with context or processor specific interpretation and the only fixed semantics of "it's a phoneme, we think". So it only needs to be distinguished from the other marker.

cormacanderson commented 3 years ago

I think it is useful that Ø is not specified as being either consonant or vowel, so putting it in as a marker makes sense. However, what would be useful is if it could host diacritics. Is that possible on a marker? I'm thinking of a variety of use cases here that fall under the general rubric of what are often called floating features.

LinguList commented 3 years ago

That would require it to be a vowel or a consonant. Since floating features are PROBABLY things that would not be shared across consonants and vowels, I'd add two more symbols, namely a C for any consonant, or a V for any vowel, which could take on features. Furthermore, one could define them in such a way that they resolve as being EQUAL to a vowel or a consonant.

cormacanderson commented 3 years ago

Generally not, although I can think of the odd instance where this is the case – mostly with length, but other features such as palatality or labiality are common to both vowels and consonants. So, you are proposing then that the symbol C would be effectively an empty consonant node, defined as a consonant, but not for any other features, that could take diacritics? And similarly V defined as a vowel? That would be useful. The only reservation I have is that these are sometimes used to define any consonant or vowel, rather than an inherently empty one, an unfilled position, for which we usually use Ø (or similar).

LinguList commented 3 years ago

Well, we will use the Ø mostly in alignments, or do you think of another concrete situation in which you'd use it? For an alignment, it doesn't need features. If you want to have a sound that could be any sound, but restrict by features, like a wildcard, I suppose to switch to C and V, which would semantically not be an empty sound. I'd also recommend to first see how far we can get with the empty sound to have our phonological-phonetic or morphonological coding, such as h a m p ə l + Ø/ə n in German, where phonetics are hampeln, but I'd like to say that the final is still in some sense a normal infinitive -en, so I can align them and annotate them, and if we realize that this is not enough, then we come up with a more complex solution?

cormacanderson commented 3 years ago

The concrete situation where I most often see Ø in the literature is where there is evidence for a consonantal (or less commonly vocalic) position ("node") that influences phonological behaviour but is not pronounced on the surface. This type of representation is all over the phonological literature. This is a nice recent example https://langsci-press.org/catalog/book/228. My own PhD dissertation uses this an awful lot. I can come up with lots of others examples too if you like. I suppose we could use C and V for these cases, but I see them as having a somewhat different function, i.e. any consonant or vowel, it not mattering what that is. The Ø is used specifically when there is no surface realisation. That's not to say, of course, that we have to follow the conventional usage, but C and V are likely to be misinterpreted, it seems to me. For the German example, out of curiousity, would + -/ə n not work? I thought we used - for the lack of any sound.

LinguList commented 3 years ago

- is the gap character in alignments. For an alignment interpretation, you are probably right, since we align one sequence against the other in the / construct, so we can say: this is an inline-alignment of two sequences, which we use as a shortcut. The Ø means "any sound" inside a correspondence pattern, where I have no data (no cognate word) for a given language. So you are right, using the - is probably more consequent in my use-cases. I should reconsider those. And defining this as an inline alignment is also nice, since it is a rather powerful construct that is still simple. C and V would be classical cases for sound classes in sound change, where I have a first test case going to do sound change with tiers. But here, I define sound classes on the fly, so I can use any symbol. This means: we should probably not rush at all and be careful before we assign the Ø now.

LinguList commented 3 years ago

But btw: if something is not pronounced on the surface, but behaves like a sound triggering something in alternations, what kind of analysis are we discussing then? It would not feature in sound inventories, but would it feature in transcriptions? Could you provide a concrete example, so I can understand this better?

cormacanderson commented 3 years ago

I'll give you a few use cases for Ø, that come to mind from the literature.

French Most French words that begin with a vowel in the surface select a vowelless allomorph of the singular article, preposition de, and take e.g. -z- liaison after the plural article, e.g. l'heure, l'humanité, le[z] histoire, etc. These can be taken to be truly vowel-intitial (h-muet). However, a smaller group takes rather the vowel-final allomorphs of these, and is often analysed as having an initial silent consonant /Ø/, e.g. la haine, la honte etc.

Old English All vowel-initial words can alliterate, irrespective of the value of the vowel. Some people (e.g. Russom 2017) thus take them as containing an initial silent consonant. This would also be quite useful I think in aligned poetry samples to pick up patterns.

Amarasi (data from Edwards O. (2017) The Phonology and Morphology of Metathesis. In Morpology DOI 10.1007/s11525-017-9314-y) Amarasi nouns have two forms, used in different syntactic environments. There is an unmetathesised U-form and a metathesised M-form, where the final C and V are metathesised, e.g. fatu~faut 'stone', besi~beis 'knife'. A final consonant is dropped, suggesting we are dealing with the final CV pair, e.g. muʔit~muiʔ 'animal'. Some nouns appear not to methathesise, but these can be accounted for under the same rule if you consider them to have a Ø between two vowels, e.g. kaØut~kauØ 'papaya', heØum~heuØ 'mango'

Southern Paiute (and other Numic) Simplifying considerably, in morpheme cocatenation, some morphemes cause the following stop to appear as a spirant or flap, some as a geminate, and some as a nasal (or nasal-stop cluster). There's been a lot of debate about how to deal with this from Sapir onwards, but the main analysis sees the morphemes that select a spirant or flap as being vowel-final (so you can say there is intervocalic lenition of stops), the ones that select a nasal as being nasal-final and the ones that select the geminate as having an abstract final feature ", that causes gemination. I would represent this " with a final Ø, so: ma- 'hand' + -tɨikka 'eat' > maɾɨkka taØ- 'foot' + -tɨkka 'eat' > tattɨkka

In other potential use cases, Ø can take features: Irish There is a palatalised velarised contrast for every consonant, e.g. /tˠ/ and /tʲ/ are separate phonemes. There is also an extensive system of prothetic consonants, e.g. the masc. sg. nom. article prefixes a coronal to vowel initial words, the 3rd plural possessive ə prefixes a nasal. The palatalised-velarised quality of the prothetic consonants is not predictable from the surface vowel quality so we have surface forms such as the following: axt 'decree', ax 'horse', ɪsʲkʲə 'water', ɪsʲpʲiːnʲ 'sausage'. However, with prothetic consonants (masc. sg. nom. article) ə tˠaxt, ə tʲax, ə tɪsʲkʲə, ə tʲɪsʲpʲiːnʲ We could deal with this by lexical selection, but a better analysis to my mind is the following: Øaxt, Øʲax, Øɪsʲkʲə, Øɪsʲpʲiːnʲ There is further support for this that I won't go into (though I obviously can if you want me to): 1) this analysis allows me to account for vowel allophony far better – in fact I can halve the size of the phonemic vowel inventory with this one move, as else the front-back quality of short vowels is always predictable from surrounding consonants, 2) there is evidence that Ø can occur also before sonorants, i.e. r/l, allowing for a considerably streamlined account of some otherwise very irregular allopmorphy in the verbal system.

Sierra Miwok Slightly different but Broadbent (1964) analyses length as a consonantal phoneme, for the same type of distributional reasons that I give for Irish above "this procedure makes it possible to simplify many statements, especially those concerning canonical forms and rules of stress"

LinguList commented 3 years ago

Nice, thanks, I'll answer to all examples I understand.

French

The liason is a good example for inline alignments (l Ø/a + i s t w a r and l a + E n). The analysis as a silent consonant does not have direct implications for a sound inventory, though, right? You could say there is something like a glottal stop (or am I mistaken?) as in the German cases, and add this to the inventory, but you would not say French has Ø as its inventory. The alternation needing an explanation will show up in inline-alignments. The idea that there is an imaginary consonant is theoretical, but has no direct consequences on coding data, in my opinion, as the alternation can be retrieved without adding this dummy consonant.

cormacanderson commented 3 years ago

Those are just the ones that come most quickly to mind. It wouldn't be difficult to find others. It's striking to me here that all of the examples I can think of are consonantal. The generativists would describe this as an "empty root node". There are a lot of people that would consider long vowels to have this structure in many languages, i.e. [aː] as /aØ/ and you can see the advantage of this in cases like compensatory lengthening, where it would allow us to align length very easily with a lost consonant, e.g. skedl- > skeØl- (i.e. skeːl-) to take an example from the history of Irish.

From this, I think that I would be in favour of adding Ø as a consonantal "sound", without any further features (beyond consonantal).

LinguList commented 3 years ago

Old English

Again a case where you would not add a consonant, but rather find a pattern, that vowels count as one sound in alliteration. Adding a glottal stop as a real interpretation seems again to be a good choice, as the same pattern can still be observed in German, where glottal stop plays a similar role.

cormacanderson commented 3 years ago

Yes, Ø as we are discussing it here often behaves similarly to a glottal stop, or indeed /h/, but I wouldn't be inclined to equate it with either, as these can also occur separately. I see Ø as a largely theoretical construct, yes, but one well motivated by looking for greater generalisation in our analysis, usually insofar as our account of the morphology or morphonological alternations are concerned.

LinguList commented 3 years ago

Yes, all cases seem to be consonantal, and in most cases, you can even give a concrete example, or you could use this to show patterns of inheritance, as in your Amarasi example, so it is kind of a historical analysis. And in this sense, it is similar to the inline alignments, which I use to make relations between allomorphs transparent, or between different stems. If you make a database of cognate words, you could really model the lenghtening process if it exists, in this way. And you could use the element to distinguish set and anit roots in Sanskrit.

LinguList commented 3 years ago

Since we do not use the construct by now (and I'd switch to inline alignments with alignment markers in my own use-cases, as you convinced me here), I wonder if we should not wait until we have a concrete datasets where this would be needed, before we start to add it, without having explored it on concrete data?

cormacanderson commented 3 years ago

I didn't know about the set and anit roots, but this is a nice case. Here, and as you say also elsewhere, this Ø is often a historical relic, where a sound has been lost or changed but leaves an effect in the morpho(phon)ology.

I wouldn't like to add ʔ in the French case, because we don't have a glottal on the surface – h would be better historically, but I think Ø is more principled. You're right though that in Old English we quite likely would have had ʔ, although it's hard to know for sure.

I'm happy to not add it for now, although I do have two datasets (one for Numic, one for Irish) that are in preparation, where this would be very useful to have. I remember a couple of years ago when I was dealing with the Numic data in Edictor that not having a symbol such as this made it much more difficult to decide how to do the morpheme segmentation (in my example from above, I couldn't use tat- 'foot' because sometimes it is tat-, sometimes tap-, sometimes tak-, etc., so I would like to use taØ-).

Should we just close this issue for now then, and I will reopen it (or open a new one) when we have data where this would be useful? Alternatively, we could just add it now and we have it there for when it is needed – it's only one consonant after all, no extra features, and unlikely to create too many problems, is it?

LinguList commented 3 years ago

I'd suggest: let us have a look at the database you want to prepare (if you want to do that via CLDF), and leave this issue open. If we have these as an example, we can add the sound already in the development version and see how well it fits there? Maybe a good time to relaunch our efforts, now that edictor has the new feature for morpheme annotation, which is a game changer in my impression?

cormacanderson commented 3 years ago

Okay, sounds good, lets do it like this.

cormacanderson commented 3 years ago

Reflecting on this, two things came to me.

One is that although it's true that one could call this a theoretical construct, to a certain extent one could say the same about other segments too, which are already abstractions from speech data. Ø is different, in that we can't see it directly, but we can see its effect. Much the same could be said for gravity.

Two, I think back to what we discussed before about Scheer's CVCV phonology and similar representational frameworks. In these frameworks, length is usually represented as being an extra CV. This is essentially similar to the Miwok example above, of length being a consonant and is how I deal with it in my PhD as well. For example, in Celtic we have the following (where 1, 2, 3 represent the beginning of a CV constituent):

1       2 3
s  kʷ e t l o      (Proto-Celtic)
x  w  e d l -      (Welsh chwedl)
sʲ kʲ e Ø l -      (Old Irish scél)

Yes, /eØ/ is phonetically [eː], but there is good reason to analyse it like this. It's exactly what I do in my PhD dissertation. Consider also the Bloch/Trager analysis of English vowels as vowel + diphthong, e.g. /ij/ etc., which has a long tradition (Jakobson does it this way, so does Labov). Just a little bit of abstraction, but useful abstraction.

LinguList commented 3 years ago

We have the ~ as a nasal extension marker, that you can use to represent an as e ~. We have experimented with this to some degree, but I haven't overused it recently. As an analysis, it is something valid, I think, and useful for some alignments. If you have a consistent dataset, in which : is better split off the vowel and analyzed in separation, why not use Ø for it? But since it surfaces as length, it is different from cases which surface as a segment, so I'd suggest a representation as :/Ø for this, so you can design an algorithm that later attaches it to a preceding vowel and also makes sure this only occurs behind vowels.

If you use this as a tool for everything, but forget about how to create the surface form when deleting it, this is a bit dangerous, as it will make you drift away from analysing the data, but give you the wrong impression you analyze it.

So the rule for introducing elements, like the ~ is that they have a clear semantics, that I can test in the code: ~ occurs only after clts.Vowel, ~ behaves like a consonant, etc.

LinguList commented 3 years ago

So if we discuss alternatives to segmental representation with two different forms, like ã t a vs. a ~ t a, or e: t a vs. e Ø t a, it would be useful to think about the semantics. If it is about compensatory lengthening, I suggest to use a more concrete symbol, and not add it to CLTS, but rather test it first.

LinguList commented 3 years ago

We should btw. also add tests or conversion routines for ~.

LinguList commented 3 years ago

So you should be able to convert a ~ t a by a function to ã t a. Similar with your : case.

LinguList commented 3 years ago

Any idea for a symbol that would be clearer than "Ø" here?

cormacanderson commented 3 years ago

Yes, I like this usage of ~ and didn't know that that was implemented in that way. This is the type of thing I have in mind.

I think that it probably depends on the language what analysis what works best. In the Miwok case I mentioned above (Broadbent 1964) she analyses length consonantally like this, with the symbol /ˑ/ and it has a consonant-like distribution.

In English, the standard analysis would be with /j w h/ and that works quite well there, I think.

In Irish, the phonemes I use Ø for have a clear distribution: initially and after a vowel. They have the same sonority profile as fricatives, in that they can occur before a sonorant initially. For that reason alone, it would be a bit weird to use a glide symbols. Also, every consonant in Irish has a secondary articulation. Seeing as they are consonants, they should too. That being the case, most consistent /Øʲ/ and /Øˠ/ in Modern Irish, and /Øʷ/, /Øʲ/ and /Ø/ in Old Irish.

cormacanderson commented 3 years ago

I agree fully, by the way, on the need to always be able to automatically get to a narrower transcription. We should always be able to do this, whatever degree of abstraction we are working with. However, once we can do, I don't see any reason to limit our abstraction: if the spell-out is clear and unambiguous then I don't see a problem with more abstract representations.

In the Old Irish case, Ø allows me to more easily spell out the surface realisations. Before a consonant, Ø spells out as length [ː], while between vowels it spells out as some form of hiatus [.] (possibly a glottal stop or a glide). So /aØ/ is [aː], while /aØa/ is [a.a]. Initially, and before a consonant /Ø/ is silent, but assimilates a preceding consonant to its secondary localisation, so C + Øʲ > Cʲ, as in the examples I gave earlier.

We have inflectional morphemes of the form e.g. /Øʲ/ that show that we are dealing with the same element here. Add this morpheme to a CV root and you get a long vowel /kʲə+Øʲ/ = /kʲəØʲ/ [kʲiː]. Add it to a consonant and it assimilates it to its secondary articulation: /kan+Øʲ/ = /kanʲ/ [kænʲ].

Here, the historical development is clear and the spell-out is too, but there isn't an obvious segmental correlate to represent it. It's a consonant node with secondary articulation but no other feature content.

LinguList commented 3 years ago

I think, we should discuss making these non-marker symbols that are not sounds to be included as some extra category, potentially with features, so one can disambiguate them, and that one can predict their behavior. So a nasal dummy merges with a vowel to become a nasal. Another dummy yields a long vowel, etc. We could design for tests a flexible class for these cases with functions, so one can analyze and manipulate them. How should we call this class?

I suppose: we only add this class to the pyclts code, and play with its definitions, but do not add anything to clts for now, only see how well this works (e.g., a + ~ = ã, etc.).

LinguList commented 3 years ago

Do you want to join my sound change camp, where i test new code to code sound change with tiers? It works in Python, but at some point, I add an online tool for Java-Script. The processes would also be good for being tested there.

cormacanderson commented 3 years ago

Sounds interesting, I'd be happy to. When is that on? Maybe lets email to arrange a meeting too, as we should also discuss the phoneme presentation in June.

cormacanderson commented 2 years ago

I have data using this symbol (in IE-CoR), which I am currently involved in preparing as a proper cldf dataset. I would like to add it as a consonant (as it is used in the examples I have come across), also as it can take further features. Ideally it would have as few intrinsic features as possible, but I realise that all other consonants have minimally phonation, place, and manner. That being the case, I would propose specifying it as a voiceless glottal approximant, i.e. the approximant counterpart of a glottal stop. How do you feel about this @LinguList ?

LinguList commented 2 years ago

If you work on orthography profiles for the IE data now, I suggest to proceed as follows:

When I find time, I'd then add an AbstractConsonant as a specific class to the sound classes in CLTS, as well as an AbstractVowel and an AbstractSound (but maybe we better try to only have Abstract Consonant and Abstract Vowel to get started).

It seems to me we should clearly treat these differently, and that we have to add new features for these. For our AbstractConsonant, we could do it like this

Sound Phonation Place Manner Type
C voiceless unspecified-place unspecified-manner consonant
G voiced unspecified-place unspecified-manner consonant

The advantage of this notation is that you can modify it. You could add any features to it, and the sound would get them. But in comparison with other features, "unspecified-manner" would be treated as IDENTICAL with any manner, and the same would hold for the "unspecified-place".

It would be ideal to have even "unspecified-voicing", but I am not sure how distribute symbols then for those.

LinguList commented 2 years ago

A, I see now:

Sound Phonation Place Manner Type
voiceless unspecified-place unspecified-manner consonant
voiced unspecified-place unspecified-manner consonant
C unspecified-voice unspecified-manner unspecified-place consonant
LinguList commented 2 years ago

Or:

Sound Phonation Place Manner Type
Ø̥̬ voiceless unspecified-place unspecified-manner consonant
Ø̥ voiceless unspecified-place unspecified-manner consonant
Ø unspecified-voice unspecified-manner unspecified-place consonant

And Ø̩ would be a vowel.

LinguList commented 2 years ago

I suppose I let you decide, @cormacanderson, but this proposal would definitely work: add new sounds with dummy markers, add new features that are "unspecified", and make the features in such a way that they can be specified.

One problem: if one wants to specify place and manner, one will have to use new symbols. So this can quickly get a bit complicated, but one could start with some symbols that are frequent.

And if this is the idea to go, I'd use C for the abstract consonant, unspecified regarding voice.

cormacanderson commented 2 years ago

This last one (correcting the first line to Ø̬ and voiced), is very elegant indeed. I really like this solution. Lets try with the dummy for now and then consider specified place and manner later, when we see how it goes with this.

LinguList commented 2 years ago

Sure, voiceless in firstline is voiced, did not really read what I wrote ;)

Adding support for features can also be done by yourself now.

LinguList commented 2 years ago

We'd only need to update the feature set here in addition to adding a new sound to consonants.tsv.

cormacanderson commented 2 years ago

So I add an unspecified voice feature, an unspecified manner feature, and an unspecified place feature as consonant features in features.tsv, then the consonant.

LinguList commented 2 years ago

No, you do not need this, as the markers are already taken.

LinguList commented 2 years ago

You need to add them to the json file (features.json, not features.tsv).

LinguList commented 2 years ago

E.g., "unspecified-place" goes here:

https://github.com/cldf-clts/clts/blob/8adfe4c6f58552e055dee15e94142f86abcb1e6b/pkg/transcriptionsystems/features.json#L132-L145

cormacanderson commented 2 years ago

Ah sorry, we are cross-posting. Okay, great, I do it there at features.json and then put in a PR, then another one for consonants.tsv. Excellent.

LinguList commented 2 years ago

Yes!