cldf-datasets / doreco

CLDF dataset derived from DoReCo's core corpus
https://doreco.info/
3 stars 0 forks source link

add script for sound classes #16

Closed FredericBlum closed 1 year ago

FredericBlum commented 1 year ago

@xrotwang I wrote a very basic and ugly script that tries to extract the features from the orthography profile. There are two problems:

a) Many of the sounds do not have their features specified in the bipa name. How should I access CLTS instead, to retrieve the full information? I've struggled to find my way through the pyclts functions b) The code takes a lot of space. Any easy ways to improve this?

LinguList commented 1 year ago

Normal way to access CLTS is:

clts = CLTS() # CLTS (path2clts-directory)
bipa = CLTS.bipa
sound = bipa[yoursound]
sound.featureset

You can test on the cmd prompt.

LinguList commented 1 year ago

Or what do you specifically need to find here?

xrotwang commented 1 year ago

@Tarotis Looking at your code, it looks like you do infer all relevant information from the BIPA names - so what else would you need? The BIPA names will be available in the SQLite data, see

sqlite> select cldf_cltsReference from parametertable limit 5;
unrounded_open_front_vowel
unrounded_close_front_vowel
voiced_alveolar_nasal_consonant
rounded_close_back_vowel
voiced_bilabial_nasal_consonant

so I can reformulate your code as SQL "function".

xrotwang commented 1 year ago

@Tarotis The features that may appear in BIPA names are specified (in the Value column) here: https://github.com/cldf-clts/clts/blob/master/data/features.tsv "geminate" and "sonorant" don't show up there - but I'd guess this could be inferred from other features? @LinguList ?

FredericBlum commented 1 year ago

The problem is that some sounds do not seem to have a name, here from the updated print statement:

(doreco) blum@lingn45 etc % python sound_classes.py
Sound: ['Grapheme', 'IPA', 'Frequency']         Class: None
Sound: ['ts`_h', 'tsʰ˞', '775']         Class: None
Sound: ['|\\|\\~', 'ǁ̃ ', '295']         Class: None
Sound: ['|\\~', 'ǀ̃ ', '286']    Class: None
Sound: ['|\\_v', '', '169']     Class: None
Sound: ['r\\=`', 'ɹ˞̩', '157']   Class: None
Sound: ['kx_>', '', '147']      Class: None
Sound: ['|\\~_h', 'ǀʰ̃ ', '131']         Class: None
Sound: ['ts`', 'ts˞', '131']    Class: None
Sound: ['|\\|\\_v', '', '126']  Class: None
Sound: ['o~_?\\', 'oˤ̃ ', '84']  Class: None
Sound: ['!\\~', 'ǃ̃ ', '82']     Class: None
Sound: ['o}:', '', '77']        Class: None
Sound: ['a~_?\\', 'aˤ̃ ', '71']  Class: None
Sound: ['!\\qX_>', '', '61']    Class: None
Sound: ['|\\qX_>', '', '59']    Class: None
Sound: ['!\\~_h', 'ǃʰ̃ ', '54']  Class: None
Sound: ['m_p', '', '43']        Class: None
Sound: ['!\\_v', '', '32']      Class: None
Sound: ['!\\q_h', '', '31']     Class: None
Sound: ['|\\|\\qX_>', '', '30']         Class: None
Sound: ['t:S_w', '', '30']      Class: None
Sound: ['=\\_v', '', '26']      Class: None
Sound: ['d_j_<', '', '25']      Class: None
Sound: ['ei:', '', '19']        Class: None
Sound: ['=\\q_h', '', '13']     Class: None
Sound: ['|\\q_h', '', '11']     Class: None
Sound: ['t_d_w', 'tʷ̪', '10']    Class: None
Sound: ['=\\~_h', 'ǂʰ̃ ', '10']  Class: None
Sound: ['cx', 'cx', '8']        Class: None
Sound: ['=\\~', 'ǂ̃ ', '8']      Class: None
Sound: ['n_jn', '', '6']        Class: None
Sound: ['Ai:', '', '6']         Class: None
Sound: ['|\\|\\q_h', '', '5']   Class: None
Sound: ['t:`', 't˞ː', '5']      Class: None
Sound: ['b_j_<', '', '4']       Class: None
Sound: ['@:e', '', '4']         Class: None
Sound: ['=\\qX_>', '', '3']     Class: None
Sound: ['d_w_<', '', '2']       Class: None
Sound: ['ks', 'ks', '1']        Class: None
Sound: ['<on>a>', '', '1']      Class: None

How do we proceed with these?

xrotwang commented 1 year ago

Almost all the sort-of frequent ones are clicks, so I'd lean towards kicking them out - just like vowels?

FredericBlum commented 1 year ago

Yes, we'll only work on pulmonic consonants - in my study, at least.

LinguList commented 1 year ago

If they are not accepted by BIPA, they are not presented well. How did they pass your orthoprofile checks then? They look all very strange and faulty to me. Sure you pass the right sounds here?

LinguList commented 1 year ago

Sorry, @xrotwang, yes. Geminates are long consonants, and sonorants are a bunch of different consonants that @Tarotis can specify by inferring their class and overlap, using some regex, but you'd have to check sounds on clts.clld.org please, and give us an example.

Note that CLTS is but ONE Feature system, and not providing ALL possible classifications. But it is pretty standard.

If you provide for each of your searches two to three examples, I may even be able to help directly, but please check also their common names and check if you cannot just infer "sonorant" from the info our standardized names give you.

xrotwang commented 1 year ago

If they are not accepted by BIPA, they are not presented well. How did they pass your orthoprofile checks then? They look all very strange and faulty to me. Sure you pass the right sounds here?

The orthography profile has XSAMPA graphemes for all segments that appear in the data. And yes, they look strange and are potentially faulty - but also mostly very rare :)

LinguList commented 1 year ago

Ah, I see. Should I give them a proper check?

LinguList commented 1 year ago

E.g. Sound: ['ks', 'ks', '1'] Class: None is k^s, k + superscript s, which we have.

xrotwang commented 1 year ago

Ah, I see. Should I give them a proper check?

Yes, it's not that many.

xrotwang commented 1 year ago

Can be done via SQL, see https://github.com/cldf-datasets/doreco/blob/main/USAGE.md#filtering-phones-based-on-features