cldf-clts / clts-legacy

Cross-Linguistic Transcription Systems
Apache License 2.0
4 stars 3 forks source link

Final fixes #123

Closed tresoldi closed 5 years ago

tresoldi commented 5 years ago

This PR relates to most of the stuff discussed in https://github.com/cldf/clts/issues/121

In detail:

Things I didn't implement from the issue:

Many stuff from the issue needs to be discussed, perhaps individually; among those:

As I said, most if not all of the other issues are related to individual transcription systems/data; I've added to CLTS/BIPA those that I found important and necessary, but the remaining ones should probably be kept in their specific contexts.

codecov-io commented 5 years ago

Codecov Report

Merging #123 into master will not change coverage. The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #123   +/-   ##
=======================================
  Coverage   99.56%   99.56%           
=======================================
  Files           8        8           
  Lines         696      696           
=======================================
  Hits          693      693           
  Misses          3        3
Impacted Files Coverage Δ
src/pyclts/models.py 100% <100%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update acb4323...68f0b4a. Read the comment docs.

tresoldi commented 5 years ago

I've run the local commands for release and the checks, but please note that most of the steps for releasing (such as bumping version number, preparing for PyPI etc.) are not in place yet.

I can take care of that once the changes are approved.

LinguList commented 5 years ago

While, as @cormacanderson, I am very in favor of adding qp and db digraphs for labio-dental affricates, there is no rush in doing so and it is probably a good idea to only include them in a second release (Cormac is also waiting for an answer from Anne-Maria). They are not formally IPA, but my opinion is that they would fit very well in BIPA considering that [i] there is no independent glyph for those sounds, [ii] the graphical solution is very good, and [iii] the symbols have been in use for quite some time

I am against that, for reasons of the parsing procedure: if they qualify as a cluster, we can't handle them, only if they are a consonant.

I'm happy with this PR, but also note that there is no way to have diphtongs being aliased, as a dipthong is a gain a derived sound, so no way to define it. Here, you need to go to the data in sources/ and manually override, and for the future, this is the preferred way, also for @cormacanderson to propose corrections that are beyond problems of our bipa-parser.

tresoldi commented 5 years ago

Sorry, my comment was not clear: the ȹ and ȸ digraphs are used for labiodental plosives, which would make it easier to annotate labiodental affricates with stuff like ȹs.

As for the diphthongs, one more things to discuss.

cormacanderson commented 5 years ago

Thanks very much for this @tresoldi. I have a few comments. 1) I don't see the logic of roundedness vs rounding. As far as I can see this is simple duplication. The relevant diacritics are defined twice here https://github.com/cldf/clts/blob/master/src/pyclts/transcriptionsystems/bipa/diacritics.tsv for vowels, i.e. line 83 and line 86; line 84 and 85. We have looked at these and they aren't different unicode points, just simple duplicates. There is no reason for this, but it's just a simple oversight, unless I am misunderstanding something. I suggest we merge with a single feature "roundedness", with two possible values, "more-rounded" and "less-rounded". 2) I'm inclined to consider the use of ɗ for dental not alveolar as the problem of the transcription dataset, and that BIPA should reflect the correct usage. 3) We should add http://graphemica.com/%C9%9D U+025D, i.e. ɝ 4) For the next release, I will do another pass through, but would note here already that the treatment of diphthongs, including ej, eʲ, etc. and triphthongs should be prioritised (possibly also consonants such as tswʰ?). There's also more that might need to be done with clicks.

LinguList commented 5 years ago
  1. I agree, be careful though, as rounding is a vowel main feature (or is it roundedness?), so if we drop one, this should be the one that is NOT a main feature of vowels (check vowels,tsv to confirm).
  2. no opinion here, as long as the bipa form is less complicated.
  3. if so, we need to add the composed version (E + retroflexation). Or is there a difference to the thingy with schwa?
  4. triphthongs won't be accepted, as they don't make sense and can be decomposed into segments most of the time, if you want to add them, we need to add a new class of sounds (probably less problematic, but not that trivial). Diphtongs like ej etc. should be handled in the transcription data by providing valid counterparts in bipa (ei or e + i with the little thing under it indicating glide).
tresoldi commented 5 years ago
  1. I agree that roundness would better be a continuum, and in any case we should not have, as we do now, stuff like "more-rounded unrounded blablabla vowel". These are found especially in the Eurasian transcription data, and are due to the more-rounded value being a different feature which is appended to the base vowel. The problem here is that people use this "more-rounded" diacritic with unrounded vowels, when they actually just mean "rounded". The solution would not be quick enough to fix this by yesterday's noon, as it would break some datasets...

  2. I would in theory, but the official IPA chart has it in the middle of the coronals, just under the click which, as far as I know, everyone considers alveolar and not dental. If we want the BIPA to be a superset of IPA, we are kinda stretching here...

  3. It is already there:

In [3]: bipa['rhotacized unrounded open-mid central vowel'].s                                                   
Out[3]: 'ɜ˞'

As usual this is a Unicode problem of pre-composed vs. composed. It should be part of the normalization, but we'd better wait for the next version (this was my fault, I looked at the list of problems and parsed the grapheme in my mind, it didn't occur to me that it could be a pre-composed one...)

  1. I would be very, very timidly in favor of adding triphtongs as a class in order to have people adopting CLTS, even if I agree with Mattis that it does not make much sense from our point of view. The lingpy pipeline, for example, would surely need to split them up, just like complex consonantal clusters. I didn't want to touch this as it needs more discussion; @cormacanderson , could you provide some examples where in your opinion it makes much more sense to treat a subsequence this way?
LinguList commented 5 years ago

As usual this is a Unicode problem of pre-composed vs. composed. It should be part of the normalization, but we'd better wait for the next version (this was my fault, I looked at the list of problems and parsed the grapheme in my mind, it didn't occur to me that it could be a pre-composed one...)

Yes, this is an example for normalization.

In general, we need to always be aware of where to fix problems. We have the following:

  1. in the original transcriptiondata (preferred way, as most errors now are there, see folder sources where you can easily fix by putting a correct bipa-value in the left-most cell)
  2. in the normalization, or by manipulating bipa's consonants, vowels, diacritics, etc.
  3. in the deep code of clts

Keeping in mind where a problem needs to be fixed will help in the future, we'll have to adjust our labels accordingly. I won't move a finger for solving problems like "ej" from the code, neither handling triphthongs, but if any of you wants to adjust the code accordingly here, feel free to do so. I think, however, it's more important to make an explicit list of accepted consonant combinations for clusters (listing all nasal+stop, stop+nasal, etc. whatever you want), as those are currently produced in an erratic fashion.

LinguList commented 5 years ago

IMPORTANT: less rounded and more rounded as "roundedness" should be given the preference, "rounding" is a main feature of a sound, and per definitionem, they can't be modified via diacritic, unless you add FULL CHARACTERS WITH DIACRITICS in vowels.tsv! This is a no-discussion, there is a workaround, and since the character is duplicated, roundedness should be deleted. These lines are anyway ignored in teh code by now.

tresoldi commented 5 years ago

Keeping in mind where a problem needs to be fixed will help in the future, we'll have to adjust our labels accordingly. I won't move a finger for solving problems like "ej" from the code, neither handling triphthongs, but if any of you wants to adjust the code accordingly here, feel free to do so. I think, however, it's more important to make an explicit list of accepted consonant combinations for clusters (listing all nasal+stop, stop+nasal, etc. whatever you want), as those are currently produced in an erratic fashion.

I fully agree here. Triphthongs should probably only have two patterns: with a trailing schwa or with with a central vocoid between approximants. For consonant clusters, I would only really accept sibilants+plosives+liquids, but I trust Cormac might convince me here. :wink:

No matter what, the priority should however be adding more normalizations and other transcription systems. Now that I know the code in more detail it should not take me too long to do a PR with my unified feature system, which would be my priority in terms of CLTS innovations.