cldf-clts / clts-legacy

Cross-Linguistic Transcription Systems
Apache License 2.0
4 stars 3 forks source link

Dipthong class for CLTS #2

Closed LinguList closed 7 years ago

LinguList commented 7 years ago

Dipthongs (and maybe even tripthongs) are missing so far. I am hesitating since we need to decide on basic features, and I don't know how to code them. There will be a clear "from" and "to", but will blow the feature-space. One could generate the base file, though, even as a general strategy and make the first and the second vowel one of our required categories for dipthongs each (just don't forget: CLTS is not about realistic sounds but about handling sounds in data, although I am strictly against breathy-voiced unvoiced plosives, etc., see #1)

tresoldi commented 7 years ago

This was my bigger obstacle, because it ends up involving other contours such as affricates and tones. My solution (I was going to discuss it next week, but we can start now) is to define two different levels of abstract representation: the base one is 'sound', which only allows for "single" airstreams (which of course is not a very good way to describe it, but it should be clear), and the second is 'segment' or 'phone'.

A 'segment' is made of one or more (usually two) sounds, as such allowing diphthongs, affricates and so on. Among the advantages, it allows to distinguish between affricates and sequences of stop+fricative (which are phonetic in most languages -- "catch it" vs. "cat shit" -- and phonemic in a few, such as Polish), to consider diphthongs as single units and to treat tones as "natural" segments: a tone is simply a segment made of vowel sounds with different pitch levels, in a way very similar to diphthongs. This also solves the problems of features for complex segments you point, such as Phoible with its horrible "+|-" or "0|+", as features would refer to sounds and not segments -- and when simulating sound changes, one can easily specify the sound of a segment that is a boundary to the rule, and not the full segment (I once did some experiments extending Phoible with a reconstructed PIE, the laryngeals were completely out of place).

I think it is a very good solution from a theoretical standpoint, but it has many limits for a practical usage and coding it has not been easy. It requires a very well crafted set of functions to translate from this representation to IPA glyphs and descriptors, but I have been struggling with this for months and I am quite confident.

LinguList commented 7 years ago

Sounds really cool. The only caveat I'd have is only practical: most of our data is underspecified, and only knowing the orthographies helps us to actually test it, while people are impatient (and it is also not feasible, we often lack the experts), to have alignments etc. on data with the ugly alphabets they have. So we now follow the code in segments to convert from an idiosyncractic to a more IPA-like representation of sounds. This works rather well so far, but I doubt the level you would need would only work if we really know the languages well, such as, e.g., in the data we try to assemble for some Sino-Tibetan languages, or the Chinese dialect data I have assembled so far.

Practical limitations, as I say. But for the rough purposes, we need the rough "CLTS", I'd say, and I also think that we need the deeper ideas as the ones you are working on. Now here's the question to you: do you think that it will be possible to set up a communication channel between your system and the CLTS? Some way to convert, say, from CLTS to your system, converting back may even not be required, as we can store the relation, etc? CLTS will than handle major normalization and testing, and once data qualifies as good enough, one may think of coverting from there to your representation (?)...

tresoldi commented 7 years ago

I was thinking about adapting it to CLTS, in fact. On the one hand we'd have this problem of granularity and lack of general expertise, but on the other I have more pressing limitations that are both practical and theoretical (for example, I am still not completely satisfied with my treatment of coarticulation in clicks and in some bizarre segments reported in languages as Pirahã). As such, incorporating my model to CLTS would help stress testing both the model itself and the data from Phoible/Fonetikode/etc., as it should be trivial (when compared, say, to IPA descriptors) to generate "all" possible sounds.

My system is nowhere mature enough to be the underlying base for anything, but some of the practicalities should be solved if it maturates as I intend. One of the aspects I am always considering is the application in automatic transcription and correction system, be it simple finite state transducers or more complex sequence-to-sequence translators. But I am venturing too much in the future.

Now, the organization is quite different, so I will probably need to "append" it to a different system (likely BIPA) instead of preparing it as an independent system, at least for the time being. When I have a minimal example ready I will make it public, so we can decide if and how to start incorporating it.

LinguList commented 7 years ago

Sounds reasonable. I think in the meantime we can profit from using similar test-sets which also show the degree of variation in datasets. As we have collected quite a lot of them, also mostly already published, they may be an excellent starting point to work along the different directions and also make potential communication between systems easier.

LinguList commented 7 years ago

working on this now, have first examples. The good news is: by splitting features in "to" and "from" to indicate first and second vowel, we can handle quite a lot in a simple way. Disadvantage: additional markup (nasal characters, etc.) will be difficult to automatically re-create and normalize, which is why we need to hard-code diphthongs, I suppose. My first PR (hopefully in a few days) will simply list many possibilities, by combining existing vowels with each other, and thus allowing us to handle those sufficiently. If hard-coding of feature-bundles (using the "names", but since they are sets, we don't need to be afraid about changes in order) is anyway the longer goal (each sound gets one ID, but if there are 50 new sounds, we can use clts to create new ones), hard-coding of these bundles is not as problematic.