Removing double vowels from vowels.tsv

cormacanderson commented 3 years ago

A problem is how to deal with combinations such as ii ee aa oo uu etc. In some descriptions, these are used taken to represent long vowels, but in others, particularly in lexical lists rather than phoneme inventories, they can represent combinations of identical vowels, i.e. what we might more correctly write with a syllable break i.i, e.e, a.a, o.o, u.u. I would like to remove all long vowels as aliases in https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/vowels.tsv. People can always set up ii > iː etc. in their orthography profiles if they so wish.

From the perspective of historical analysis, the two vowel analysis will often be preferable. After all, linguists will often chose this representation for a good reason.

I note that even when dealing with phoneme inventories the status quo throws up some quite weird looking constellations at times. Look at stan1318, one of the most well-described languages on earth, where it throws up oddities such as i̯ː and u̯ː. See also beng1280 there, where the analysis in PHOIBLE is perfectly consistent, but oo > oː means that there is only one long vowel in the language once it has been through BIPA. A corrolary here would be to allow diphthongs of the type uu̯ in parallel toau̯ and iu̯ etc..

LinguList commented 3 years ago

Yes, but two vowel analysis would require them to do so in their orthography profile. This is where it has to happen. CLTS is just saying: if you place them already into one sound slot, please do not do so. I mean, otherwise, it would be a diphthong, like from_i to_i, which is also not good, right?

cormacanderson commented 3 years ago

I agree that the orthography profile is the best place for these things to be determined. A problem with having this as an alias is diagnosis: when I was checking through the list of characters in PHOIBLE, or this time around in IE-CoR, I didn't notice it: it's very difficult when you are eyeballing an orthography profile of over 1000 lines and it isn't flagged. It was only later, looking through the inventories in https://digling.org/phonobank/, that I noticed it (this shows, btw, how useful the browser is as a tool).

I think that it should be a general principle that when a character (combination) is typically used for more than one meaning we should not normalise it or have it as an alias. We should force people to specify a value in the othography profile. This will stop errors like the ones we find from occurring. Here e.g. ee can be either eː or e.e and we don't know which one it is, so we shouldn't automatically parse it as one rather than the other. We had this recently also with ł, which is similarly ambiguous as sometimes people use it for ɫ (i.e. lˠ), sometimes for ɬ.

In one of the examples I give above, the original was e.g. uu̯, which is indeed a kind of diphthong, although I agree it's weird because it's homorganic. I think I would be in favour of allowing things like this though: especially for the purposes of an aligned dataset, I think it makes a lot of sense. It's a bit weird, but in the Arabic case I gave in the example above, we can be pretty sure that this is what the original author intended.

It is certainly better than the u̯ː which we get and which I actually think should not be allowed at all. This is semantically pretty much identical to wː, which feels to me to be quite a long way from uu̯. Is there some way we can stop "long" and the "non-syllabic" diacritic occurring together?

LinguList commented 3 years ago

Again, this may be a case for post-processing: the rules of the system go only as far, and it is often useful to allow for things, rather than to prohibit them, and then put one more system on top to normalize this more.

For the vowels: we can remove aliases. But handling ditphong from_i to_i is something that will need some checks in coding.

cormacanderson commented 3 years ago

I think this is a different case than the voiced aspirates, because there is a genuine ambiguity here that makes the semantics of VV unclear, rather than just a convention that uses a non-principled character (e.g. bʰ) to stand for something slightly different.

My preference would be to remove the aliases and I will go ahead and remove the double vowel aliases now from vowels.tsv.

As for allowing combinations of two identical vowels to occur as diphthongs, I think probably that this is the principled approach, particularly if one of them is clearly marked as non-syllabic, e.g. uu̯. I am very happy to help out with any checks needed in coding if you let me know how I can contribute there.

As a corollary to what I propose above, I will happily set up a condition that explicitly blocks a long non-syllabic vowel, e.g. u̯ː. I think w is more principled for something like this.

LinguList commented 3 years ago

I am fine with removing the aliases and treating these cases as complex sounds from_x to_y. At the moment, u	u̯ is illegitimate, which is annoying me as well. So this would also fix this case.

cldf-clts / clts

Removing double vowels from vowels.tsv #120