cldf-clts / clts

Cross-Linguistic Transcription Systems
https://clts.clld.org
14 stars 3 forks source link

Updates #32

Closed tresoldi closed 3 years ago

tresoldi commented 3 years ago

This PR updates BIPA with some of the most conservative changes proposed at https://github.com/cldf-clts/clts/issues/27

Most other changes are either debatable or would better be added only to some transcription data, namely Phoible.

tresoldi commented 3 years ago

Pinging @cormacanderson

cormacanderson commented 3 years ago

To my phonologist ears, a "cluster" necessarily implies more than one unit segment. In practice, not just in PHOIBLE but also across a whole load of source material, what is meant by /kp/, /gb/ etc. is a coarticulated stop, not a "cluster" in the sense that phonologists normally use them. The same holds true for /ph/, /kw/, /ʔm/ and a whole host of similar segments with secondary articulation. This has a practical motivation and is probably an inevitable consequence of what happens when you build a system of phonological transcription based on the Latin alphabet.

I see a limited danger here with this in respect to some segments, where there is ambiguity: e.g. is /ʔj/ really /ʔʲ/ or /ʲʔ/. However, if we want to enhance comparability across datasets that are not of the best quality, such as some of the input into PHOIBLE, and much of what is in Lexibank, I would consider the odd mistaken parsing to be worth it.

If we are to be guided by practical usage, then I would parse all of these as unit segments, as that is what will give us maximal comparability and make the tool most useful. You are right, Mattis, that this does not conform to how people should use the IPA, but if we are intending to make much use of material not from just the best sources (e.g. JIPA, good quality grammars), then this will really help a lot.

LinguList commented 3 years ago

@cormacanderson, as we have the class cluster especially for the use of doubly-articulated consonants, we could discuss changing their NAME, to match semantics, or we could leave it, as it is technically not important.

If we treat doubly-articulated consonants now as a case of a normal consonant which we give some fake-features, then we need to go through all 500+ clusters which ARE already accepted by CLTS and add them to our bipa list.

But the better way is to check the restrictions in the code and to make sure that we accept clusters such as nm and ŋm as well. When doing so, we do not need to add these to the normal bipa. This makes most sense to me. Calling these cluster also makes sense to me, since a cluster does not require an order (!). This is easily forgotten. And the fact that people always write kp and almost never pk indicates an inherent order ANYWAY. If not, we'd have to accept that ŋm is the same as .

So my suggestion is: do NOT add clusters here, but handle what is allowed as a cluster by changing the code. I can do that in a spearate PR.

LinguList commented 3 years ago

Yes, this is fine, but as I said: adding it to the consonants.tsv is the wrong place, as this requires to change the code.

cormacanderson commented 3 years ago

I have to go out now, but I see now what you are saying. I am more concerned with practical implementation than how it is coded, but would you be happy then to extend it to /ŋmgb/? This seems a lot to deal with in the code and means breaking the rule of two that we had preventing also triphthongs. As I say, gotta run, but we can also just schedule a call of stakeholders and discuss all of this more calmly.

tresoldi commented 3 years ago

I was not aware that clusters were supposed to be doubly-articulated consonants. I might have been mislead by the fact that they are treated in parallel to diphthongs, which also implied (to me) that the order is significant.

I will wait for @cormacanderson 's comments, but I suppose we can indeed change the pyclts code to allow nasal + nasal. The name "cluster" still sounds inappropriate or at least not in line with phonological literature, and the "from X to Y" name is surely misleading, but this is a different problem and not related to the paper on the inventories. The PR should not depend on that.

LinguList commented 3 years ago

I do not want to have ŋmgb, as I do not believe that this exists as a real sound. It is a mere notation device here, with people drawing new segmentations, as I could also do for German, claiming that spr is one sound. The modification to allow for ŋm is trivially done, I can do it later, but no more. If we do not manage to get all of the complexity of phoible, so be it. We do not do this to get 100% for phoible, but to provide something that's useful.

LinguList commented 3 years ago

@tresoldi, let us please stop with the discussions about whether this is the right name. I think, if you check the clusters that are linked in clts, you will see that they are obviously containing double-articulations, otherwise, it doesn't make sense. I'll handle the code change now, but later, as it seems that you have not checked the code in detail yet (judging from the cluster discussion), so we limit the PR here please to the one or two things that were not controversial.

tresoldi commented 3 years ago

Following BIPA notation, ŋmgb could be ⁿgb but I don't know how we would parse it with the current code. Maybe ⁿgⁿb, as the order is supposed to be meaningless.

tresoldi commented 3 years ago

@LinguList I have checked the code. This is why I mentioned that diphthongs and clusters are treated at an equivalent level, both in models.py and during parsing. Even more, the parsing code allows different manners of articulation in the same cluster, unlike the IPA definition, and treats each sound in a cluster in a different way. For example, it allows stop+fricative but not fricative+stop (here: https://github.com/cldf-clts/pyclts/blob/8a1b823ea5fce6082ae61d675ced747f1168640b/src/pyclts/transcriptionsystem.py#L209 ).

You are right that the vast majority of the linked clusters only make sense as doubly-articulated sounds, but not all of them do. In Phoible, for example, we have things like "bz" and "ɟʑ" as clusters, along with entries like "mʱbʱ". I now realize some of these might have been parsing problems, but I was studying them in detail before opening this PR.

But good thing that now it is decided, we keep them as clusters. I can take care of adding nasal+nasal to pyclts if you want, otherwise I will wait.