IMPORTANT: many missing sounds in lapsyd and eurasian and wrong representation in phoible

LinguList commented 3 years ago

There are inconsistencies in our transcriptiondata, resulting from:

U+02bc (apostrophe) and U+2019 (something else), which are normalized to the first, but our phoible data uses 2019, yet the phoible cldf dataset has the apostrophe in occasions, so we need to re-do some of the mapping systematically
NFD vs. NFC
missing sounds in lapsyd and eurasian

These should be solved before releasing.

LinguList commented 3 years ago

Examples for lapsyd:

ɖɽ          | 1
ʈɽ          | 1
ɳɖɽ         | 1
'ɤ̃'        | 1
'ɵ'         | 1
'ə̃'        | 1
'ə˞'        | 1
'rrʲ'       | 1
'əː'        | 1
d̪n̪        | 1
ɟɲ          | 1
dl          | 1
ɟʎ          | 1
ɖɳ          | 1
ɖɭ          | 1
dn          | 1
'ẽː'       | 1
'ɤ'         | 2
'õː'       | 2
pʲʰ         | 3
t̪ʲʰ        | 3
'ẽ'        | 6
'õ'        | 7
'oː'        | 13
'eː'        | 13
'rr'        | 14
'ə'         | 18
'e'         | 43
'o'         | 45

So when exporting the data for lapsyd, these were clearly missed!

LinguList commented 3 years ago

And in JIPA, we have clear cases that point to errors in the data:

ai au       | 1
w̆ r        | 1
n̪d̪ʲ ntʲ   | 1
aʔ iʔ       | 1
            | 1
ǁ’          | 1
eː          | 1
ɛːɒ̯ˤ; ɪːɒ̯ˤ    | 1
əʁ̞         | 1
l           | 1
ld          | 1
l ʎ         | 1
ɔoũ        | 1
tʃʰ dʒ      | 1
r           | 1
v s         | 1
ɔ̤          | 1
əɪa         | 1
əʊɪ         | 1
eəɪ         | 1
iəɪ         | 1
aʊɪ         | 1
oəɪ         | 1
(ɯ)         | 1
(y)         | 1
u ɚ         | 1
uai         | 1
iou         | 1
uei         | 1
ɛ ɛː        | 1
ɒ ɒː        | 1
øː ɑː       | 1
jai̯        | 1
jau̯        | 1
jeu̯        | 1
wei̯        | 1
wai̯        | 1
z̻          | 1
iau         | 2

E.g., all with spaces.

cormacanderson commented 3 years ago

I thought I flagged some of the JIPA ones. A few of them I also wrote to @SimonGreenhill about. Others I may have missed.

As for the LAPSyD ones, it's a bit of a mystery to me what the problem is with some of them, e.g. t̪ʲʰ and pʲʰ, which should be fine. Checking https://github.com/cldf-clts/clts/tree/master/sources/lapsyd/graphemes I see that these are labelled there in the BIPA column, which points to something wrong.

In all, the fact that these things are not sorted would lead me to think that there might be a few things left to do with CLTS and that we should check also some of the other datasets. I don't have time to look at this today and probably not tomorrow either, but should have this week.

@LinguList if you tell me what I can do here, I'll do it on Monday. Is this a case of remapping these in https://github.com/cldf-clts/clts/tree/master/sources.

LinguList commented 3 years ago

@cormacanderson, what I thinnk happened is that the list of graphemes.tsv we compiled for phoible, lapsyd, eurasian, and jipa are not truly showing all the symbols which we find in the original datasets (!). So what this means is that the list of graphemes should be recompiled from the datasets (by this, I mean https://github.com/cldf-datasets/jipa and the like).

The fact that even the l is missing in JIPA is a bit alarming. But what I need to check also is if there's a space or something to it. So the procedure would be:

check the lapsyd, phoible (phoible is okay mostly), jipa, eurasian from their CLDF-datasets
compile a list of graphemes.tsv for CLTS for each dataset
compare the list with existing graphemes.tsv and add missing symbols

All in all, this can be done automatically by myself up to the point where it comes to checking the last elements.

In the meantime, @cormacanderson, if you have time, it would be nice if you already look at the results that I computed, as I'd like to know if I should compute more or if this is okay. A list of individual differences can and will also be output for you to inspect.

LinguList commented 3 years ago

I have identified all sounds by tweaking cldf-datasets/lapsyd/ and the data is now completely covered.

LinguList commented 3 years ago

There is one sound , as you can see when checking pkg/transcriptiondata/lapsyd, @cormacanderson, but this is not clear what it means anyway. Otherwise, we are good with lapsyd, I'll check eurasian later.

cormacanderson commented 3 years ago

We can resolve this sound in the same way as I have dealt with other unspecified coronals in LAPSyD, i.e. by using the symbol without diacritic. However, that means adding a sound to consonants.csv. I've put in a PR for this.

cormacanderson commented 3 years ago

Nice work on resolving this @LinguList

cldf-clts / clts

IMPORTANT: many missing sounds in lapsyd and eurasian and wrong representation in phoible #91