cldf-clts / clts-legacy

Cross-Linguistic Transcription Systems
Apache License 2.0
4 stars 3 forks source link

current feature system #66

Closed LinguList closed 6 years ago

LinguList commented 6 years ago
sound_class feature value diacritic
consonant articulation strong ◌͈
consonant aspiration aspirated ◌ʰ
consonant aspiration aspirated.sibilancy:sibilant
consonant breathiness breathy ◌ʱ
consonant creakiness creaky ◌̰
consonant duration long ◌ː
consonant ejection ejective ◌’
consonant glottalization glottalized ◌ˀ
consonant labialization labialized ◌ʷ
consonant laminality apical ◌̺
consonant laminality laminal ◌̻
consonant laterality lateral
consonant manner affricate
consonant manner approximant
consonant manner click
consonant manner fricative
consonant manner implosive
consonant manner nasal
consonant manner stop
consonant manner tap
consonant manner trill
consonant nasalization nasalized ◌̃
consonant palatalization labio-palatalized ◌ᶣ
consonant palatalization palatalized ◌ʲ
consonant pharyngealization pharyngealized ◌ˤ
consonant phonation voiced
consonant phonation voiceless
consonant place alveolar
consonant place alveolo-palatal
consonant place bilabial
consonant place dental
consonant place epiglottal
consonant place glottal
consonant place labial
consonant place labialized-palatal
consonant place labialized-velar
consonant place labio-dental
consonant place palatal
consonant place palatal-velar
consonant place pharyngeal
consonant place post-alveolar
consonant place retroflex
consonant place uvular
consonant place velar
consonant preceding postoralized
consonant preceding pre-aspirated ʰ◌
consonant preceding pre-glottalized ˀ◌
consonant preceding pre-labialized ʷ◌
consonant preceding pre-nasalized ⁿ◌
consonant preceding pre-palatalized ʲ◌
consonant release unreleased ◌̚
consonant release with-lateral-release ◌ˡ
consonant release with-mid-central-vowel-release ◌ᵊ
consonant release with-nasal-release ◌ⁿ
consonant sibilancy sibilant
consonant stress primary-stress ˈ◌
consonant stress secondary-stress ˌ◌
consonant syllabicity syllabic ◌̩
consonant velarization velarized ◌ˠ
consonant voicing devoiced
consonant voicing revoiced ◌̬
vowel advancement advanced ◌̟
vowel articulation strong ◌͈
vowel breathiness breathy ◌̤
vowel centrality back
vowel centrality central
vowel centrality centralized ◌̈
vowel centrality front
vowel centrality mid-centralized ◌̽
vowel centrality near-back
vowel centrality near-front
vowel creakiness creaky ◌̰
vowel duration long ◌ː
vowel duration mid-long ◌ˑ
vowel duration ultra-long
vowel duration ultra-short ◌̆
vowel frication with-frication
vowel glottalization glottalized ◌ˀ
vowel height close
vowel height close-mid
vowel height mid
vowel height near-close
vowel height near-open
vowel height nearly-open
vowel height open
vowel height open-mid
vowel nasalization nasalized ◌̃
vowel pharyngealization pharyngealized ◌ˤ
vowel raising lowered ◌̞
vowel raising raised ◌̝
vowel retraction retracted ◌̠
vowel rhotacization rhotacized ◌˞
vowel roundedness rounded ◌̹
vowel roundedness unrounded ◌̜
vowel rounding less-rounded ◌̜
vowel rounding more-rounded ◌̹
vowel stress primary-stress ˈ◌
vowel stress secondary-stress ˌ◌
vowel syllabicity non-syllabic ◌̯
vowel tone with_downstep ◌↓
vowel tone with_extra-high_tone ◌̋
vowel tone with_extra_low_tone ◌̏
vowel tone with_falling_tone ◌̂
vowel tone with_global_fall ◌↘
vowel tone with_global_rise ◌↗
vowel tone with_high_tone ◌́
vowel tone with_low_tone ◌̀
vowel tone with_mid_tone ◌̄
vowel tone with_rising_tone ◌̌
vowel tone with_upstep ◌↑
vowel tongue_root advanced-tongue-root ◌̘
vowel tongue_root retracted-tongue-root ◌̙
vowel velarization velarized ◌ˠ
vowel voicing devoiced ◌̥
LinguList commented 6 years ago

Here's the same thing in spreadsheet:

features.tsv.txt

LinguList commented 6 years ago

I think, with this, we can have a fruitful discussion about what should be changed, etc., and I consider also using this to afterwards automatically check whether our data is correct.

LinguList commented 6 years ago

Here's the code to extract the features:

In [39]: from pyclts import *

In [40]: bipa = TranscriptionSystem()

In [41]: table = []

In [42]: for k, v in bipa._features['consonant'].items():
    ...:     row = ['consonant', bipa._feature_values[k], k, v]
    ...:     table += [row]
    ...:     
    ...:     

In [43]: for k, v in bipa._features['vowel'].items():
    ...:     row = ['vowel', bipa._feature_values[k], k, v]
    ...:     table += [row]
    ...:     
    ...:     
    ...:     
    ...:     

In [44]: for s in bipa._sounds:
    ...:     if not bipa[s].type == 'marker':
    ...:         for f in bipa[s]._features():
    ...:             if not bipa[s].type in [ 'tone', 'marker']:
    ...:                 table += [[bipa[s].type, bipa._feature_values[f], getattr(bipa[s], bipa._feature_values[f]), bipa._features[bipa[s].type].get(f, '')]]
    ...:         

In [45]: table = sorted(set([tuple(x) for x in table if not None in x]))

In [46]: table = [['sound_class', 'feature', 'value', 'diacritic']] + table

In [47]: with open('features.tsv', 'w') as f:
    ...:     for line in table:
    ...:         f.write('\t'.join(line)+'\n')
    ...:         
LinguList commented 6 years ago

sorry, the file is not good, use this excel file here, if you want to have a closer look at the features:

features.xlsx

tresoldi commented 6 years ago

It seems ok, I might have worked some things in different way but it is clearly a matter of preference (in fact, your system does look more neutral than what pops in my mind).

One thing I'm not sure if I follow is the treatment of non-pulmonic consonants. Nasal clicks, for example, would be defined as "nasalized clicks"? If so, it seems inconsistent with pulmonic consonants, where you have "stop" and "nasal" as different manners.

LinguList commented 6 years ago

Good point, I just assumed naively that I could label them as nasalized, but when looking back at this chart, which @afehn recommended:

nakagawa-2013-khoisan-phonotactics.pdf

(by Nakagawa 2013), I see that this is some nasal cluster. This is easy to handle, we can just discard the extra-symbols, and allow that the clusters can consist of "nasal+click" (that's the beauty of the generative system). I'll make an issue.

tresoldi commented 6 years ago

Two more observations:

LinguList commented 6 years ago

This is easy to answer: because we want to model the grapheme system of IPA, and be able to parse the sounds.

Since linguists make too much mess with their freedom and create too many inconsistent data, this allows us at least to describe what they annotate.

Consider this:

In [1]: from pyclts import *

In [2]: bipa = TranscriptionSystem()

In [5]: bipa['breathy voiceless bilabial stop consonant'].s
Out[5]: 'pʱ'

In fact, there are cases of Hmong-Mien languages where grammars insist that there is some unvoiced sound which has a breathy release. If we insist on saying that phonation has "voiced", "unvoiced", "breathy-voiced", and "creaky-voiced", we won't be able to capture these differences, and we'll have to spell out (since the algo can't overwrite features, which is a design principle), for each base-sound + creaky/breathy combination, that this sound is existent and legitimate.

Furthermore, consider the name:

In [7]: bipa['breathy voiced bilabial stop consonant'].name
Out[7]: 'breathy voiced bilabial stop consonant'

This comes close to the traditional notion of "breath-voiced bilabial stop", without the dash. We can easily find all cases of breathiness, etc., by making set comparison. In fact, in order to break down those things to ASJP sound classes, where breathiness is switched off, we can parse so far unknown sounds by their base features and reduce them to the correct symbol:

In [8]: asjp = TranscriptionData('asjp')

In [9]: asjp['breathy voiceless bilabial stop consonant']
Out[9]: 'p'

So recall, that this is a practical approach, which does not really care how economic, pleasant-to-the-eye, or reasonable a feature system is. Instead, this approach attempts to be able to infer from the graphemes fed to the algo to render what people write. This is the first step towards comparability, if we wanted to teach people how to do phonetic transcriptions, or to impose a feature system that we think is better than all the rest, we would use something else, but "bipa" starts from the symbols and tries to translate them literally into the features invoked by the system. All additional labor can be done later. "bipa" corrects errors via normalization by resolving lookalikes and the alias system (breathiness can also be expressed by the diacritic dots beyond, so we choose one version here, to normalise), but it does not care whether the sounds people propose are possible or meaningful. Based on my practice with language data, I consider this the only way to proceed. The first step of rendering things comparable. But a considerable step. We first need to be able to handle the data, once we are in such a stage, we can think of bringing it to some use.

tresoldi commented 6 years ago

Thank you for the explanation, it makes total sense now! I'd even suggest you incorporate parts of it in a general "What is the idea behind CLTS" documentation.

Funny thing is that my own system is now clearer to me: as a "laboratory experiment", it is much more "essentialist" in its approach to phonology (if you really must use common names, it is closer to acoustics), and maybe it could be used in tandem in certain experiments (if it proves usable, that is).

LinguList commented 6 years ago

Thanks for understanding. I think it is important our discussion here, to keep this in mind: before we can make our own feature systems or encode data in the systems based on other people's feature system, we must be able to handle as much diverse data as possible. We can't do this by superimposing our favorite feature system without looking into the compositionality of graphemes, as this would mean that we have to code a huge amount of sounds manually, adding features, even before we start to compare whether these are reflected in our data or not. CLTS-BIPA, on the other hand, is able to give names to sounds, to correct obvious unicode errors by providing normalization, and despite it's currently rather small base inventory of pre-defined sounds and features, it can generate already now a huge amount of data.

So everybody is invited to provide their feature system, like Ladefoged or the system by @tresoldi, by providing a list, as indicated, with many different concrete sounds segments, and your feature specifications, and we can then see how to automatically link it via CLTS, and if this is successful, we can add it as transcription-data to the database.

LinguList commented 6 years ago

To allow for an easy and constant re-generation of the feature system, I'll add the script features.py to the cookbook-collection and the feature system will be presented as json in the transcriptionssystems/.