tresoldi commented 5 years ago

This PR relates to most of the stuff discussed in https://github.com/cldf/clts/issues/121

In detail:

I have added the missing linguolabial as a value for feature place of consonants,, as well as the most important sounds to the catalog. There are some issues for discussion here, as per IPA all linguolabial consonants need a diacritic (i.e., there is no linguolabial consonant with its own, diacritic-less representation), which in turn makes things a bit complex when setting an alias. As such, no stuff like U032B (combining arches) was implemented: the only diacritic for linguolabial place of articulaton is the standard IPA U033C (the seagull).
I have merged the "centralization", "retraction", and "advancement" features into a single feature "relative_articulation" (possibly not the best name), as we were allowing for things like "advanced retracted centralized open front vowel". This is now fixed.
The above feature of "relative_articulation" can now be applied to consonants (so we support things like t̟ and ŋ˗).
Consonants can now have "mid-long" as "duration", so we allow for things like mˑ.
Laminal fricatives are there (such as s̻ and z̻).
Grapheme ł (Polish letter) is now normalized to ɬ (voiceless alveolar lateral fricative).
I've added ᶑ (voiced retroflex implosive), b̪ (voiced labio-dental stop), and p̪ (voiceless labio-dental stop) to BIPA.
I've renamed feature value "labialized-velar" to "labio-velar" and "labialized-palatal" to "labio-palatal" in CLTS; all the transcription data and systems were updated (using sed from the command line, I've checked many times and don't think there are any false positives or negatives in the replacements)

Things I didn't implement from the issue:

The dot for syllabicity would lead to confusion and it has not been standard IPA for quite some-time; it's implementation, if really necessary, should be part of specific TranscriptionData and TranscriptionSystems, not CLTS/BIPA.
I didn't add ◌͇ (U0347) as a diacritic for alveolar, as it is not IPA (once more, it should be included in ad-hoc transcription data, not in BIPA), and diacritics for place of articulation are better part of the catalog of sounds than that of diacritics (as place of articulation is one of the essential features).
I didn't add uvularization of vowels (cases such as ʌʶ); while I believe there is room for them, I agree with @cormacanderson that as a general feature is questionable, and we should discuss this in more detail; one potential source can be found here, but many more references are presented by the most basic Google serach (mostly when describing dialects, and I couldn't find a clear-cut case where it is phonemic).
I didn't change ɗ from dental to alveolar (even though alveolar as a default place of articulation, requiring a diacritic for the dental, makes sense to me); we need to discuss this in more detail, especially considering that the transcription data we are linking to seem to use it as dental.
I didn't add roundness to approximants, as this would involve adding the feature to all consonants; while I don't oppose this from an articulatory point of view, it should be further discussed (if we go for the feature only for approximants, things are more complicated, as we'd need to either set approximants as a different sound type from consonants or to change the code in order to implement the limitation).
I made no changes to the position of the voiceless diacritic (above or below the glyph), as it is currently not possible to have a one-fits-all solution; any apparently simple change can result in unintended consequences (the easiest solution is probably to just default to one position and manually list the alternatives in the sound catalog, but once more this is something to be discussed and agreed upon).
While, as @cormacanderson, I am very in favor of adding qp and db digraphs for labio-dental affricates, there is no rush in doing so and it is probably a good idea to only include them in a second release (Cormac is also waiting for an answer from Anne-Maria). They are not formally IPA, but my opinion is that they would fit very well in BIPA considering that [i] there is no independent glyph for those sounds, [ii] the graphical solution is very good, and [iii] the symbols have been in use for quite some time
I didn't add ı (dotless i U0131) as an alias of ɯ U026F: this is really a matter of Turkish orthography and not phonological transcription, and if necessary should be part of a Turkish orthographic profile.
I didn't touch triphtongs and all other complex clusters, as CLTS only supports two-sound clusters by design. I can understand objections to that, but this should really be first discussed with @lingulist .
Some redudant/tautological information (such as syllabic vowels and nasalized nasal stops) are part of the design of CLTS; this can be changed by checking for redundant features, but it is not something that can be implemented with five minutes of coding. In any case, @lingulist should be part of this discussion.
I didn't add ḱ and ɡ́ as aliases of palatal stops, as this should be part of transcription data / orthographic profiles dealing with PIE, not BIPA.
I tend to agree with @cormacanderson that rounding and roundedness should not be different features, which would mean adding a continuum of unrounded, less-rounded, rounded, and more-rounded values. However, I didn't change that as it would brake some datasets such Eurasian (which might be problematic anyway, with its "more-rounded unrounded" vowels which, from a quick inspection, likely come from problems in parsing the diacritic for "more-rounded" as one for "rounded", see the case of Bulgarian in their website) and it is something that should be investigated (for example, shold we just take this as aliases for protusion and compression, or endo- and exolabial?).

Many stuff from the issue needs to be discussed, perhaps individually; among those:

Tone contours starting with zero (such as ⁰²)
Palatalized vowels as aliases for diphthongs (such as ɛʲ, and also long such as oːʷ)
Aspirated vowels (such as ɔʰ)

As I said, most if not all of the other issues are related to individual transcription systems/data; I've added to CLTS/BIPA those that I found important and necessary, but the remaining ones should probably be kept in their specific contexts.

codecov-io commented 5 years ago

Codecov Report

Merging #123 into master will not change coverage. The diff coverage is 100%.

@@           Coverage Diff           @@
##           master     #123   +/-   ##
=======================================
  Coverage   99.56%   99.56%           
=======================================
  Files           8        8           
  Lines         696      696           
=======================================
  Hits          693      693           
  Misses          3        3

Impacted Files	Coverage Δ
src/pyclts/models.py	`100% <100%> (ø)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update acb4323...68f0b4a. Read the comment docs.

tresoldi commented 5 years ago

I've run the local commands for release and the checks, but please note that most of the steps for releasing (such as bumping version number, preparing for PyPI etc.) are not in place yet.

I can take care of that once the changes are approved.

LinguList commented 5 years ago

While, as @cormacanderson, I am very in favor of adding qp and db digraphs for labio-dental affricates, there is no rush in doing so and it is probably a good idea to only include them in a second release (Cormac is also waiting for an answer from Anne-Maria). They are not formally IPA, but my opinion is that they would fit very well in BIPA considering that [i] there is no independent glyph for those sounds, [ii] the graphical solution is very good, and [iii] the symbols have been in use for quite some time

I am against that, for reasons of the parsing procedure: if they qualify as a cluster, we can't handle them, only if they are a consonant.

I'm happy with this PR, but also note that there is no way to have diphtongs being aliased, as a dipthong is a gain a derived sound, so no way to define it. Here, you need to go to the data in sources/ and manually override, and for the future, this is the preferred way, also for @cormacanderson to propose corrections that are beyond problems of our bipa-parser.

tresoldi commented 5 years ago

Sorry, my comment was not clear: the ȹ and ȸ digraphs are used for labiodental plosives, which would make it easier to annotate labiodental affricates with stuff like ȹs.

As for the diphthongs, one more things to discuss.

cormacanderson commented 5 years ago

Thanks very much for this @tresoldi. I have a few comments. 1) I don't see the logic of roundedness vs rounding. As far as I can see this is simple duplication. The relevant diacritics are defined twice here https://github.com/cldf/clts/blob/master/src/pyclts/transcriptionsystems/bipa/diacritics.tsv for vowels, i.e. line 83 and line 86; line 84 and 85. We have looked at these and they aren't different unicode points, just simple duplicates. There is no reason for this, but it's just a simple oversight, unless I am misunderstanding something. I suggest we merge with a single feature "roundedness", with two possible values, "more-rounded" and "less-rounded". 2) I'm inclined to consider the use of ɗ for dental not alveolar as the problem of the transcription dataset, and that BIPA should reflect the correct usage. 3) We should add http://graphemica.com/%C9%9D U+025D, i.e. ɝ 4) For the next release, I will do another pass through, but would note here already that the treatment of diphthongs, including ej, eʲ, etc. and triphthongs should be prioritised (possibly also consonants such as tswʰ?). There's also more that might need to be done with clicks.

LinguList commented 5 years ago

I agree, be careful though, as rounding is a vowel main feature (or is it roundedness?), so if we drop one, this should be the one that is NOT a main feature of vowels (check vowels,tsv to confirm).
no opinion here, as long as the bipa form is less complicated.
if so, we need to add the composed version (E + retroflexation). Or is there a difference to the thingy with schwa?
triphthongs won't be accepted, as they don't make sense and can be decomposed into segments most of the time, if you want to add them, we need to add a new class of sounds (probably less problematic, but not that trivial). Diphtongs like ej etc. should be handled in the transcription data by providing valid counterparts in bipa (ei or e + i with the little thing under it indicating glide).

tresoldi commented 5 years ago

I agree that roundness would better be a continuum, and in any case we should not have, as we do now, stuff like "more-rounded unrounded blablabla vowel". These are found especially in the Eurasian transcription data, and are due to the more-rounded value being a different feature which is appended to the base vowel. The problem here is that people use this "more-rounded" diacritic with unrounded vowels, when they actually just mean "rounded". The solution would not be quick enough to fix this by yesterday's noon, as it would break some datasets...
I would in theory, but the official IPA chart has it in the middle of the coronals, just under the click which, as far as I know, everyone considers alveolar and not dental. If we want the BIPA to be a superset of IPA, we are kinda stretching here...
It is already there:

In [3]: bipa['rhotacized unrounded open-mid central vowel'].s                                                   
Out[3]: 'ɜ˞'

As usual this is a Unicode problem of pre-composed vs. composed. It should be part of the normalization, but we'd better wait for the next version (this was my fault, I looked at the list of problems and parsed the grapheme in my mind, it didn't occur to me that it could be a pre-composed one...)

I would be very, very timidly in favor of adding triphtongs as a class in order to have people adopting CLTS, even if I agree with Mattis that it does not make much sense from our point of view. The lingpy pipeline, for example, would surely need to split them up, just like complex consonantal clusters. I didn't want to touch this as it needs more discussion; @cormacanderson , could you provide some examples where in your opinion it makes much more sense to treat a subsequence this way?

LinguList commented 5 years ago

As usual this is a Unicode problem of pre-composed vs. composed. It should be part of the normalization, but we'd better wait for the next version (this was my fault, I looked at the list of problems and parsed the grapheme in my mind, it didn't occur to me that it could be a pre-composed one...)

Yes, this is an example for normalization.

In general, we need to always be aware of where to fix problems. We have the following:

in the original transcriptiondata (preferred way, as most errors now are there, see folder sources where you can easily fix by putting a correct bipa-value in the left-most cell)
in the normalization, or by manipulating bipa's consonants, vowels, diacritics, etc.
in the deep code of clts

Keeping in mind where a problem needs to be fixed will help in the future, we'll have to adjust our labels accordingly. I won't move a finger for solving problems like "ej" from the code, neither handling triphthongs, but if any of you wants to adjust the code accordingly here, feel free to do so. I think, however, it's more important to make an explicit list of accepted consonant combinations for clusters (listing all nasal+stop, stop+nasal, etc. whatever you want), as those are currently produced in an erratic fashion.

LinguList commented 5 years ago

IMPORTANT: less rounded and more rounded as "roundedness" should be given the preference, "rounding" is a main feature of a sound, and per definitionem, they can't be modified via diacritic, unless you add FULL CHARACTERS WITH DIACRITICS in vowels.tsv! This is a no-discussion, there is a workaround, and since the character is duplicated, roundedness should be deleted. These lines are anyway ignored in teh code by now.

tresoldi commented 5 years ago

Keeping in mind where a problem needs to be fixed will help in the future, we'll have to adjust our labels accordingly. I won't move a finger for solving problems like "ej" from the code, neither handling triphthongs, but if any of you wants to adjust the code accordingly here, feel free to do so. I think, however, it's more important to make an explicit list of accepted consonant combinations for clusters (listing all nasal+stop, stop+nasal, etc. whatever you want), as those are currently produced in an erratic fashion.

I fully agree here. Triphthongs should probably only have two patterns: with a trailing schwa or with with a central vocoid between approximants. For consonant clusters, I would only really accept sibilants+plosives+liquids, but I trust Cormac might convince me here. :wink:

No matter what, the priority should however be adding more normalizations and other transcription systems. Now that I know the code in more detail it should not take me too long to do a PR with my unified feature system, which would be my priority in terms of CLTS innovations.

cldf-clts / clts-legacy

Final fixes #123

Codecov Report