cldf-clts / clts

Cross-Linguistic Transcription Systems
https://clts.clld.org
13 stars 3 forks source link

Checking Transcription data with new script #84

Closed LinguList closed 3 years ago

LinguList commented 3 years ago

I added a new script which tests if the data is valid, here's what the script yields for phoible:

clts test_dataset phoible
INFO    cldf-clts/clts at /home/mattis/data/datasets/cldf/clts
ERROR   unknown sound encountered for BIPA «ts͇» (Line 1333)
ERROR   unknown sound encountered for BIPA «ts͇ʰ» (Line 1334)
ERROR   unknown sound encountered for BIPA «ts͇» (Line 1539)
ERROR   unknown sound encountered for BIPA «ts͇ʰ» (Line 1540)
INFO    duplicate grapheme in the data: ˦
ERROR   unknown sound encountered for BIPA «ⁿdz͇» (Line 3168)
ERROR   unknown sound encountered for BIPA «ⁿts͇ʰ» (Line 3172)
INFO    Found 6 errors in the data.

@cormacanderson, I think the test is correctly pointing to a problem here, so do you mind to check this as well, once you go through the other datasets? I already ran the test in jipa, lapsyd, and it did not yield any errors.

cormacanderson commented 3 years ago

I will remap these to ts̠ etc. which is our default representation here. This was not consistent in consonants.tsv until the last commit.

cormacanderson commented 3 years ago

Sorry, read rather tθ̠ above. I made submitted the changes to the PHOIBLE graphemes.tsv

LinguList commented 3 years ago

Are these already done? Or are they in a PR? I just pulled the data, and did not see any change.

LinguList commented 3 years ago

sorry:

$ clts test_dataset phoible
INFO    cldf-clts/clts at /home/mattis/data/datasets/cldf/clts
INFO    duplicate grapheme in the data: ˦
INFO    No errors found in the data

So this is fine!