Preliminary version for X-SAMPA

tresoldi commented 6 years ago

As discussed at https://github.com/cldf/clts/issues/84

This is a preliminary version; a number of diacritics are missing and some checking should be performed in terms of escaping with backslashes (which unfortunately are very common in X-SAMPA). It is important to note that Wells (1995) does not seem to be fully compatible with what is reported on the Wikipedia article, which is essentially the now de facto standard. In any case, most of the examples on Wikipedia pass.

tresoldi commented 6 years ago

In [1]: from pyclts import TranscriptionSystem

In [2]: xsampa = TranscriptionSystem('xsampa')

In [3]: xsampa['u\\']
Out[3]: <pyclts.models.Vowel: rounded close central vowel>

LinguList commented 6 years ago

I think we can merge for the moment. In fact, I have been working a lot with a much more extended version of SAMPA, which handles tones and has some other useful diacritics. I feel inclined to add those at some point, but I think for the time being, it is just incredibly useful to have X-SAMPA also for demonstration purposes. I think, but I'll need to test, further, that by using the r'raw-string' feature in Python we should be able to use backslashes without problem (?).

LinguList commented 6 years ago

Here's our problem (I think I knew it before, but I only realized it now): the whole rationale behind diacritics and non-diacritics does no longer work. As a result the sampa-system cannot parse a sound like xsampa['t_h'] since it is not defined. This renders the aspiration diacritic useless. What needs to be done is spelling out all of those sounds explicitly, or alternatively adding a tweak to the code for transcriptionsystems, but I'd be inclined to not do that.

Furthermore, I detected some problems when testing: normal sounds are not yet in the sampa-consonants (I just added a few), and more importantly: sibilancy is not marked, which means that sampa.translate('Z', bipa) gives an unknown translation, as Z is defined as a post-alveolar fricative, while it needs to be a post-alveolar sibilant fricative.

I think, for the future (not for now, we just leave sampa, but leave the issue open), we should go the crude way: spell out all most frequent sounds (like 2000 or so), automatically convert to SAMPA using a conversion table, like the one I posted last time, or the one in lingpy, and manually correct missing entries and instances that are wrong. SAMPA will have no diacritics, unless they don't contain a base symbol (like the _h, which confuses our regexes), but it will be quite powerful.

xrotwang commented 6 years ago

I second not adding any (more) tweaks to the TranscriptionSystem code. It is already complex and I think readability and transparency is of highest importance for this kind of code.

LinguList commented 6 years ago

Yes, I agree, and as I just mentioned in #84, I think we should make sampa either a specific function that converts from sampa to unicode, or use lingpy for the time being.

tresoldi commented 6 years ago

I second, too. When I noticed the issues with the backslashes I started to study the TranscriptionCode and it is indeed complex enough.

Listing the more frequent sounds for solving X-SAMPA issues will have the good side effect of resulting in a long list of all expected sounds, which would be useful both for testing and for future research (the closer we have at the time in this sense is Phoible).

2018-01-29 7:23 GMT-02:00 Johann-Mattis List notifications@github.com:

Yes, I agree, and as I just mentioned in #84 https://github.com/cldf/clts/issues/84, I think we should make sampa either a specific function that converts from sampa to unicode, or use lingpy for the time being.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cldf/clts/pull/98#issuecomment-361185681, or mute the thread https://github.com/notifications/unsubscribe-auth/AAar96VB-9oKNVgZNuSxLaVVt040PrOQks5tPY4CgaJpZM4Rv0W8 .

cldf-clts / clts-legacy

Preliminary version for X-SAMPA #98