Setting up baselines for testing: Phoible, PBase and Fonetikode

LinguList commented 7 years ago

The test in cookbook/phoible.py is a good proof-of-concept, showing that many things work well. By generating sounds, some 300+ symbols of phoible can be nicely and uniquely defined. The drawback are the cases which do not work, but here, I think a subselection will need to be made, given that phoible has some features which may be problematic:

order seems to count, so they distinguish kʰʲ and kʲʰ, which does not really make sense to me (especially given the distribution). CLTS should normalize this and convert it to only one available variant.
```
>>> from pyclts.clts import CLTS
>>> clts = CLTS()
>>> snd1, snd2 = clts.get_sound('kʰʲ'), clts.get_sound('kʲʰ')
>>> print(snd1, snd2)
kʰʲ kʰʲ
>>> print(snd1.source, snd2.source)
kʲʰ kʰʲ
```
They also show impossible sounds, that is, sounds where one wonders whether they REALLY exist as such, e.g., this one c̤, which would be a "breathy-voiced-voiceless palatal plosive consonant" as opposed to a "breathy-voiced-voiced palatal plosive consonant": ɟ̤.

I think, we can theoretically produce some impossible sounds, like, say, aspirated voiced stuff, but it is essential that we keep some "features", like, e.g., breathy-voice or creaky-voice as alternatives to the phonation category.

So in general, we should try to get a good baseline of characters from Fonetikode, PBase, and Phoible to illustrate how they are displayed in CLTS, and ideally, we could already link to them.

tresoldi commented 7 years ago

I am aware of this kind of distinction in Phoible, they are a minor headache for me too. Concerning you example, they do reflect the same sound, but the differentiation makes sense from a phonological point of view (Phoible's sources are mostly from and for field work, as you know): kʰʲ would be a base aspirated phone(me) palatalized by some process (which, btw, would probably go into lenition) and kʲʰ a base palatalized phone(me) aspirated by some other process. It is even possible to postulate an inventory in which both occur and are phonemically distinct (such inventory would be very unstable, but who knows what a sadistic conlanger is capable of).

While the detailing might be relevant for historical linguistics, I am not really sure about how to solve it, and given that these are probably the results of processes and not part of the languages' inventories, I would suggest using only one variant, otherwise we will quickly descend into a generative spiral of deciding what is base form, what is surface form, and so on.

This case too looks like the result of a phonological process described in an articulatory way. In the case of phonation, my idea is to use real (i.e., non boolean) values to indicate the level of phonation and to "translate" to voiceless/breathy/.../creaky according to a) the type system and b) the language (so that what is breathy in a system might be voiceless in another one).

I know it is very debatable suggestion, would never push for it now.

In any case, a system able to incorporate in a test (most of) Fonetikode, PBase, and Phoible would be a good baseline and an impressive result.

LinguList commented 7 years ago

I guess I see why some linguists might come up with these things, but I have problems imagining that they would be distinctive (similar to tone systems, where people write something like ⁵⁴⁴, although it is never really audible, since most tones are realized in quick speech as high or low). Somehow, and this is a problem of the CLTS, we will need to fix a certain level of granularity, as I want to avoid collecting too many differences as phoible does, but rather make languages more easily comparable at this level (and who knows in those cases: maybe people just started with the annotation of their language as they did not know which was the best order, and then kept going on with it, right?, so we always have human error in transcription vs. fine-grained differences vs. inter-speaker variation, etc.). I guess the decisions for fixing a degree of coarseness will in any case raise objections, but if we don't fix it, we loose comparability. Maybe, by just making it clear from the beginning, we can prevent people from getting too angry at us.

tresoldi commented 7 years ago

It is interesting to see how we start to face the same problems while coming from different directions and with different goals.

My idea is to allow for as much granularity as possible in terms of description of articulation and airflow, even considering "possible but not attested sounds". It is of course far beyond my intention to deal with areas such as ventriloquism or speech disorders (which are surprisingly well covered by my sound/segment distinction), yet at the same time considering them in an abstract unified model has been very helpful in highlighting my shortcoming -- just as remembering that it is an abstraction for model purposes, and not some theoretical deep model on how phonology "really" works.

For the time being, my solution has been essentially to specify defaults selected according to majorities, in which typologist would probably find me a bit too (western-)eurocentric. In any case, I agree that it should be clear from the beginning, and that we should stress a lot on tests and tutorials -- in fact, this might be a case where defining the tests first and then working to meet them might be advisable.

tresoldi commented 7 years ago

One note I was thinking. Despite its name, the IPA is generally used as a phonemic alphabet, where the closest match to a given sound segment tends more to be employed in contrast with other segments than to actually indicate how the sound is or should be pronounced. For example, while /t/ and /d/ indicate alveolar sounds, language inventories use them as they are when representing dental or even palato-alveolar sounds, both cases where diacritics would be needed. Such practice makes sense, for the simplicity, and usually makes no difference (in the case above, languages rarely have a phonemic difference between dentals and alveolars, and in those cases the diacritics can be used; the same applies, for example, in the description of speech disorders).

When performing multilanguistic studies, however, it is not so simple to solve such problems, particularly in the case of vowels: while a glyph like /t/ represents segments that are more or less alike across languages (and, in particular, which tend to be represented as such by all linguists), the segmentation of the vowel space is far more unstable. Not only even in related languages the areas represented by the same glyph show great variation, but it has already been demonstrated that sometimes there is no overlapping of the vowel areas attributed to the same glyph in different languages, such as with /a/.

The schwa (the "default") vowel, is at the same time a solution and an example to such problem. While it is described in the IPA as a mid central unrounded non-tense vowel, it is actually used in the representation of generic "middle" vowels, with varying degrees of openness, roundness, etc. even in the same speaker.

This is a problem I have been facing for sometime, only partially solved by features (like in Phoible, where a null value, different from positive or negative, seems to be specified in such cases, overlapping with its usage as "non applicable"). We should consider the possibility of extending BIPA to express "classes" (for example, "open vowel"), defaulting to some middle or acceptable value if needed.

LinguList commented 7 years ago

The good thing in historical linguistics is that we can often circumvent this by applying a "symbolic" perspective: what counts is the correspondence or the contrast, not the value (features etc.) of what the sound is attached to. For alignments, for automatic cognate search, this is thus a thing that can be ignored to the most.

The problem re-occurs, however, when trying to build up information on likely sound-correspondence or likely sound-change patterns. Here, one could either break things down to ASJP (there is this paper by Brown et al. 2013 sound classes or something else, all not very informative in the end (although I like the paper in general).

There will have to be a down-breaking, but so far, I hope we can solve this in BIPA by employing a fall-back procedure which could break down complexity in varying degrees. Think of two or more additional sound classes on top of the systems we already have, but then with 100, 200, or even 500 different sound symbols (still segmental, as CLTS will be segmental in its core).

What is important here is that the handling of fall-backs is implemented properly, as indicated for BIPA vs. ASJP in #4.

LinguList commented 7 years ago

[x] phoible: first proof-of concept is done
[ ] fonetikode: checked the data, but adding it is pending
[ ] pbase: needs to figure out how to link
[ ] index diachronica (just link to available sounds)
[x] wikipedia (link to available sounds)
[ ] jipa (pending that @SimonGreenhill publishes his paper)

LinguList commented 7 years ago

closing this, as this gets too long to read, will carry on in #19

cldf-clts / clts-legacy

Setting up baselines for testing: Phoible, PBase and Fonetikode #1