Closed LinguList closed 3 years ago
DATA | STATS | PERC |
---|---|---|
Unique graphemes | 12384 | |
different sounds | 8754 | |
singletons | 8802 | |
multiples | 3582 | |
consonants | 5512 | 0.6296550148503541 |
vowels | 1844 | 0.21064656157185288 |
diphthongs | 707 | 0.0807630797349783 |
clusters | 559 | 0.06385652273246516 |
tones | 132 | 0.015078821110349555 |
id valid total percent
------------ ------- ------- ---------
apics 177 177 1.00
bdpa 1328 1466 0.91
bdproto 734 794 0.92
beijingdaxue 124 124 1.00
chomsky 45 45 1.00
diachronica 552 652 0.85
eurasian 1347 1562 0.86
jipa 894 967 0.92
lapsyd 696 795 0.88
multimedia 132 138 0.96
nidaba 1864 1936 0.96
panphon 6219 6334 0.98
pbase 810 1068 0.76
phoible 2574 3183 0.81
powoco 369 378 0.98
ruhlen 434 701 0.62
saphon 343 357 0.96
segbo 215 219 0.98
wiki 166 184 0.90
18 0.90
@tresoldi, @cormacanderson, I added bdproto, segbo, and saphon, which gives us 18 datasets now, and I used our workflow to correct these marginal datasets. Please have a look once you find time, as this shows how the workflow works. The important files are all those in the folder sources
, called "graphemes.tsv".
I think we can merge and keep refining the mapping later, correct?
Yes, just merge. You can also merge the Python code. And you could have a look at adding more datasets (there are still issues, I prepared even more dataset in the morning, which you filed), and refining, e.g., the jipa, which I just made as well.