V13a - Githubissues

cldf-clts / clts

Cross-Linguistic Transcription Systems

https://clts.clld.org

14 stars 3 forks source link

V13a #39

Closed LinguList closed 3 years ago

LinguList commented 3 years ago

DATA	STATS	PERC
Unique graphemes	12384
different sounds	8754
singletons	8802
multiples	3582
consonants	5512	0.6296550148503541
vowels	1844	0.21064656157185288
diphthongs	707	0.0807630797349783
clusters	559	0.06385652273246516
tones	132	0.015078821110349555

LinguList commented 3 years ago

id            valid    total      percent
------------  -------  -------  ---------
apics         177      177           1.00
bdpa          1328     1466          0.91
bdproto       734      794           0.92
beijingdaxue  124      124           1.00
chomsky       45       45            1.00
diachronica   552      652           0.85
eurasian      1347     1562          0.86
jipa          894      967           0.92
lapsyd        696      795           0.88
multimedia    132      138           0.96
nidaba        1864     1936          0.96
panphon       6219     6334          0.98
pbase         810      1068          0.76
phoible       2574     3183          0.81
powoco        369      378           0.98
ruhlen        434      701           0.62
saphon        343      357           0.96
segbo         215      219           0.98
wiki          166      184           0.90
18                                   0.90

LinguList commented 3 years ago

@tresoldi, @cormacanderson, I added bdproto, segbo, and saphon, which gives us 18 datasets now, and I used our workflow to correct these marginal datasets. Please have a look once you find time, as this shows how the workflow works. The important files are all those in the folder sources, called "graphemes.tsv".

tresoldi commented 3 years ago

I think we can merge and keep refining the mapping later, correct?

LinguList commented 3 years ago

Yes, just merge. You can also merge the Python code. And you could have a look at adding more datasets (there are still issues, I prepared even more dataset in the morning, which you filed), and refining, e.g., the jipa, which I just made as well.