Closed tresoldi closed 5 years ago
Merging #123 into master will not change coverage. The diff coverage is
100%
.
@@ Coverage Diff @@
## master #123 +/- ##
=======================================
Coverage 99.56% 99.56%
=======================================
Files 8 8
Lines 696 696
=======================================
Hits 693 693
Misses 3 3
Impacted Files | Coverage Δ | |
---|---|---|
src/pyclts/models.py | 100% <100%> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update acb4323...68f0b4a. Read the comment docs.
I've run the local commands for release and the checks, but please note that most of the steps for releasing (such as bumping version number, preparing for PyPI etc.) are not in place yet.
I can take care of that once the changes are approved.
While, as @cormacanderson, I am very in favor of adding qp and db digraphs for labio-dental affricates, there is no rush in doing so and it is probably a good idea to only include them in a second release (Cormac is also waiting for an answer from Anne-Maria). They are not formally IPA, but my opinion is that they would fit very well in BIPA considering that [i] there is no independent glyph for those sounds, [ii] the graphical solution is very good, and [iii] the symbols have been in use for quite some time
I am against that, for reasons of the parsing procedure: if they qualify as a cluster, we can't handle them, only if they are a consonant.
I'm happy with this PR, but also note that there is no way to have diphtongs being aliased, as a dipthong is a gain a derived sound, so no way to define it. Here, you need to go to the data in sources/
and manually override, and for the future, this is the preferred way, also for @cormacanderson to propose corrections that are beyond problems of our bipa-parser.
Sorry, my comment was not clear: the ȹ and ȸ digraphs are used for labiodental plosives, which would make it easier to annotate labiodental affricates with stuff like ȹs.
As for the diphthongs, one more things to discuss.
Thanks very much for this @tresoldi. I have a few comments.
1) I don't see the logic of roundedness vs rounding. As far as I can see this is simple duplication. The relevant diacritics are defined twice here https://github.com/cldf/clts/blob/master/src/pyclts/transcriptionsystems/bipa/diacritics.tsv for vowels, i.e. line 83 and line 86; line 84 and 85. We have looked at these and they aren't different unicode points, just simple duplicates. There is no reason for this, but it's just a simple oversight, unless I am misunderstanding something. I suggest we merge with a single feature "roundedness", with two possible values, "more-rounded" and "less-rounded".
2) I'm inclined to consider the use of ɗ
for dental not alveolar as the problem of the transcription dataset, and that BIPA should reflect the correct usage.
3) We should add http://graphemica.com/%C9%9D U+025D, i.e. ɝ
4) For the next release, I will do another pass through, but would note here already that the treatment of diphthongs, including ej, eʲ, etc. and triphthongs should be prioritised (possibly also consonants such as tswʰ?). There's also more that might need to be done with clicks.
I agree that roundness would better be a continuum, and in any case we should not have, as we do now, stuff like "more-rounded unrounded blablabla vowel". These are found especially in the Eurasian transcription data, and are due to the more-rounded value being a different feature which is appended to the base vowel. The problem here is that people use this "more-rounded" diacritic with unrounded vowels, when they actually just mean "rounded". The solution would not be quick enough to fix this by yesterday's noon, as it would break some datasets...
I would in theory, but the official IPA chart has it in the middle of the coronals, just under the click which, as far as I know, everyone considers alveolar and not dental. If we want the BIPA to be a superset of IPA, we are kinda stretching here...
It is already there:
In [3]: bipa['rhotacized unrounded open-mid central vowel'].s
Out[3]: 'ɜ˞'
As usual this is a Unicode problem of pre-composed vs. composed. It should be part of the normalization, but we'd better wait for the next version (this was my fault, I looked at the list of problems and parsed the grapheme in my mind, it didn't occur to me that it could be a pre-composed one...)
As usual this is a Unicode problem of pre-composed vs. composed. It should be part of the normalization, but we'd better wait for the next version (this was my fault, I looked at the list of problems and parsed the grapheme in my mind, it didn't occur to me that it could be a pre-composed one...)
Yes, this is an example for normalization.
In general, we need to always be aware of where to fix problems. We have the following:
sources
where you can easily fix by putting a correct bipa-value in the left-most cell)Keeping in mind where a problem needs to be fixed will help in the future, we'll have to adjust our labels accordingly. I won't move a finger for solving problems like "ej" from the code, neither handling triphthongs, but if any of you wants to adjust the code accordingly here, feel free to do so. I think, however, it's more important to make an explicit list of accepted consonant combinations for clusters (listing all nasal+stop, stop+nasal, etc. whatever you want), as those are currently produced in an erratic fashion.
IMPORTANT: less rounded and more rounded as "roundedness" should be given the preference, "rounding" is a main feature of a sound, and per definitionem, they can't be modified via diacritic, unless you add FULL CHARACTERS WITH DIACRITICS in vowels.tsv! This is a no-discussion, there is a workaround, and since the character is duplicated, roundedness should be deleted. These lines are anyway ignored in teh code by now.
Keeping in mind where a problem needs to be fixed will help in the future, we'll have to adjust our labels accordingly. I won't move a finger for solving problems like "ej" from the code, neither handling triphthongs, but if any of you wants to adjust the code accordingly here, feel free to do so. I think, however, it's more important to make an explicit list of accepted consonant combinations for clusters (listing all nasal+stop, stop+nasal, etc. whatever you want), as those are currently produced in an erratic fashion.
I fully agree here. Triphthongs should probably only have two patterns: with a trailing schwa or with with a central vocoid between approximants. For consonant clusters, I would only really accept sibilants+plosives+liquids, but I trust Cormac might convince me here. :wink:
No matter what, the priority should however be adding more normalizations and other transcription systems. Now that I know the code in more detail it should not take me too long to do a PR with my unified feature system, which would be my priority in terms of CLTS innovations.
This PR relates to most of the stuff discussed in https://github.com/cldf/clts/issues/121
In detail:
linguolabial
as a value for featureplace
of consonants,, as well as the most important sounds to the catalog. There are some issues for discussion here, as per IPA all linguolabial consonants need a diacritic (i.e., there is no linguolabial consonant with its own, diacritic-less representation), which in turn makes things a bit complex when setting an alias. As such, no stuff like U032B (combining arches) was implemented: the only diacritic for linguolabial place of articulaton is the standard IPA U033C (the seagull).t̟
andŋ˗
).mˑ
.s̻
andz̻
).ł
(Polish letter) is now normalized toɬ
(voiceless alveolar lateral fricative).ᶑ
(voiced retroflex implosive),b̪
(voiced labio-dental stop), andp̪
(voiceless labio-dental stop) to BIPA."labialized-velar"
to"labio-velar"
and"labialized-palatal"
to"labio-palatal"
in CLTS; all the transcription data and systems were updated (usingsed
from the command line, I've checked many times and don't think there are any false positives or negatives in the replacements)Things I didn't implement from the issue:
ʌʶ
); while I believe there is room for them, I agree with @cormacanderson that as a general feature is questionable, and we should discuss this in more detail; one potential source can be found here, but many more references are presented by the most basic Google serach (mostly when describing dialects, and I couldn't find a clear-cut case where it is phonemic).ɗ
from dental to alveolar (even though alveolar as a default place of articulation, requiring a diacritic for the dental, makes sense to me); we need to discuss this in more detail, especially considering that the transcription data we are linking to seem to use it as dental.ı
(dotless i U0131) as an alias ofɯ
U026F: this is really a matter of Turkish orthography and not phonological transcription, and if necessary should be part of a Turkish orthographic profile.rounding
androundedness
should not be different features, which would mean adding a continuum ofunrounded
,less-rounded
,rounded
, andmore-rounded
values. However, I didn't change that as it would brake some datasets such Eurasian (which might be problematic anyway, with its "more-rounded unrounded" vowels which, from a quick inspection, likely come from problems in parsing the diacritic for "more-rounded" as one for "rounded", see the case of Bulgarian in their website) and it is something that should be investigated (for example, shold we just take this as aliases for protusion and compression, or endo- and exolabial?).Many stuff from the issue needs to be discussed, perhaps individually; among those:
⁰²
)ɛʲ
, and also long such asoːʷ
)ɔʰ
)As I said, most if not all of the other issues are related to individual transcription systems/data; I've added to CLTS/BIPA those that I found important and necessary, but the remaining ones should probably be kept in their specific contexts.