cldf-clts / clts

Cross-Linguistic Transcription Systems
https://clts.clld.org
13 stars 3 forks source link

Incorporate previous mappings by @cormacanderson whenever possible #73

Closed tresoldi closed 3 years ago

tresoldi commented 3 years ago

This PR incorporates, whenever possible, the previous mapping corrections by @cormacanderson to the phoible, jipa, and lapsyd sources. This is related to his comments here, following the list I had previously prepared here.

Note that, in order to make the diff shorter and review easier, I am only updating the related graphemes.tsv files. After the changes are accepted I will regenerate the corresponding packages. Also note that, because of this, I am making a PR into the sources branch, and not into master.

Not all mappings provided in those files could be used; most are related to things we are not supporting, such as triphthongs, but a handful are graphemes that are not parsed by pyclts as it is. The numbers of invalid graphemes are 129 for phoible, 5 for jipa, and 19 for lapsyd. I am attaching here a detailed list of all such graphemes (phoible.not_bipa.txt, jipa.not_bipa.txt, and lapsyd.not_bipa.txt).

In total, this PR corrects/changes/fixes 38 mappings for phoible, 13 for jipa, and 11 for lapsyd. I am attaching lists of replacements as well: phoible.replaced.txt, jipa.replaced.txt, and lapsyd.replaced.txt.

cormacanderson commented 3 years ago

First the PHOIBLE replacements. OK Raw grapheme ð̺̞ mapping was replaced from ɹ̪̺ to ð̞̺. Raw grapheme d̙ˤ mapping was replaced from to d̙ˤ. Raw grapheme xʀ̥ mapping was replaced from <NA> to xʀ̥. Raw grapheme ᴅ̪̰ mapping was replaced from <NA> to ɾ̪̰. Raw grapheme mapping was replaced from ɾ to ɾ̪.

OK frictionalised diacritic: Raw grapheme r̪͓|r͓ mapping was replaced from <NA> to . Raw grapheme mapping was replaced from ʮ to . Raw grapheme ʟ͓̥ mapping was replaced from ʟ̥͓ to ʟ͓̥. Raw grapheme ɯ͓ mapping was replaced from ɿ to ɯ͓. Raw grapheme kǁ͓ʰ mapping was replaced from ǁʰ͓ to ǁ͓ʰ.

OK with acceptable loss of information: Raw grapheme ɹ̪̩ˠ mapping was replaced from ɹ̪̩ˠ to ɹ̪ˠ. Raw grapheme ɾ̪̊ mapping was replaced from ɾ̪̥ to ɾ̪̊. Raw grapheme ɾ̪̊ʰ mapping was replaced from ɾ̪̥ʰ to ɾ̪̊ʰ.

OK but possibly diagnostic of problems (where do ⁿdɹ̠˔ʷ and ɹ̪ come from in the first place?): Raw grapheme n̠d̠ʒʷ mapping was replaced from ⁿdɹ̠˔ʷ to ⁿdʒʷ. Raw grapheme ð̞ mapping was replaced from ɹ̪ to ð̞.

NOT OK and probably diagnostic of problems: Raw grapheme ⁿt̪s̪ʰ mapping was replaced from ⁿts̪ʰ to ⁿtθʰ. Raw grapheme n̪t̪s̪ʰ mapping was replaced from ⁿts̪ʰ to ⁿtθʰ.

OK diacritic order (according to https://github.com/cldf-clts/clts/issues/54): Raw grapheme pʲʰ mapping was replaced from pʰʲ to pʲʰ. Raw grapheme t̪ʲʰ mapping was replaced from t̪ʰʲ to t̪ʲʰ.

NOT OK diacritic order (according to https://github.com/cldf-clts/clts/issues/54): Raw grapheme pʷˠʰ mapping was replaced from pʷˠʰ to pʷʰˠ. Raw grapheme pˠʰ mapping was replaced from pˠʰ to pʰˠ. Raw grapheme pˤʰ mapping was replaced from pˤʰ to pʰˤ. Raw grapheme qʷˤʰ mapping was replaced from qʷˤʰ to qʷʰˤ. Raw grapheme qˤʰ mapping was replaced from qˤʰ to qʰˤ. Raw grapheme tsˤʰ mapping was replaced from tsˤʰ to tsʰˤ. Raw grapheme t̠ʃˤʰ mapping was replaced from tʃˤʰ to tʃʰˤ. Raw grapheme t̪ˠʰ mapping was replaced from t̪ˠʰ to t̪ʰˠ. Raw grapheme t̪ˠʰ|tˠʰ mapping was replaced from <NA> to tʰˠ. Raw grapheme t̪ˤʰ mapping was replaced from t̪ˤʰ to t̪ʰˤ. Raw grapheme ŋ̥ǃˠˀ mapping was replaced from ŋǃˠˀ to ŋǃˀˠ.

ISSUE? Poor readability of diacritics (non-release diacritics on length): Raw grapheme p͉ʲ mapping was replaced from p̚ʲ to pʲ̚. Raw grapheme t͉ʲ mapping was replaced from t̚ʲ to tʲ̚. Raw grapheme k͉ʷ mapping was replaced from k̚ʷ to kʷ̚. Raw grapheme ä̠ː mapping was replaced from <NA> to aː̈. Raw grapheme äː mapping was replaced from äː to aː̈. Raw grapheme a̟ː mapping was replaced from a̟ː to aː̟.

ISSUE? Poor readability of diacritics (and maybe here the stop should have the devoicing diacritic?): Raw grapheme d̥ʒ̥ mapping was replaced from dʒ̊ to dʒ̥. Raw grapheme d̥ʒ̊ mapping was replaced from dʒ̊ to dʒ̥.

cormacanderson commented 3 years ago

Now the PHOIBLE not BIPA, first the correctly PHOIBLE not BIPA. TRIPHTHONGS For grapheme aʊɪ, BIPA aʊɪ is not supported. For grapheme eəɪ, BIPA eəɪ is not supported. For grapheme iau, BIPA iau is not supported. For grapheme iäu̽, BIPA iäu̽ is not supported. For grapheme iou, BIPA iou is not supported. For grapheme iɑu, BIPA iɑu is not supported. For grapheme iəɪ, BIPA iəɪ is not supported. For grapheme iɛi̯, BIPA iɛi̯ is not supported. For grapheme i̯ai, BIPA i̯ai is not supported. For grapheme i̯ai̯, BIPA i̯ai̯ is not supported. For grapheme i̯au, BIPA i̯au is not supported. For grapheme i̯au̯, BIPA i̯au̯ is not supported. For grapheme i̯ei̯, BIPA i̯ei̯ is not supported. For grapheme i̯eu̯, BIPA i̯eu̯ is not supported. For grapheme i̯oi, BIPA i̯oi is not supported. For grapheme i̯uaː, BIPA i̯uaː is not supported. For grapheme i̯ui, BIPA i̯ui is not supported. For grapheme i̯uo, BIPA i̯uo is not supported. For grapheme i̯ũo, BIPA i̯ũo is not supported. For grapheme i̯æi, BIPA i̯æi is not supported. For grapheme i̯ɛi, BIPA i̯ɛi is not supported. For grapheme i̯ɛi̯, BIPA i̯ɛi̯ is not supported. For grapheme oəɪ, BIPA oəɪ is not supported. For grapheme uai, BIPA uai is not supported. For grapheme uei, BIPA uei is not supported. For grapheme uei̯, BIPA uei̯ is not supported. For grapheme uie, BIPA uie is not supported. For grapheme uiɛ, BIPA uiɛ is not supported. For grapheme uɑi, BIPA uɑi is not supported. For grapheme uəi, BIPA uəi is not supported. For grapheme u̯ai, BIPA u̯ai is not supported. For grapheme u̯ai̯, BIPA u̯ai̯ is not supported. For grapheme u̯ei̯, BIPA u̯ei̯ is not supported. For grapheme u̯eu̯, BIPA u̯eu̯ is not supported. For grapheme u̯æi, BIPA u̯æi is not supported. For grapheme u̯ɛi, BIPA u̯ɛi is not supported. For grapheme u̯ɛi̯, BIPA u̯ɛi̯ is not supported. For grapheme yia, BIPA yia is not supported. For grapheme əɪa, BIPA əɪa is not supported. For grapheme əʊɪ, BIPA əʊɪ is not supported. For grapheme ʊ̯aɪ̯, BIPA ʊ̯aɪ̯ is not supported.

ILLICIT "CLUSTERS" For grapheme d̪l̪, BIPA d̪l̪ is not supported. For grapheme st, BIPA st is not supported. For grapheme s̙ˤ, BIPA s̙ˤ is not supported. For grapheme s̻θ, BIPA s͇θ is not supported. For grapheme t̪ʙ, BIPA t̪ʙ is not supported. For grapheme r̠̙, BIPA r̠̙ is not supported. For grapheme , BIPA is not supported. For grapheme kf, BIPA kf is not supported. For grapheme kl, BIPA kl is not supported. For grapheme ld, BIPA ld is not supported. For grapheme l̠˞, BIPA l̠˞ is not supported. For grapheme m̥m, BIPA m̥m is not supported. For grapheme n̥n, BIPA n̥n is not supported. For grapheme ŋ̊ŋ, BIPA ŋ̊ŋ is not supported. For grapheme xh, BIPA xh is not supported. For grapheme xk, BIPA xk is not supported. For grapheme ɬl, BIPA ɬl is not supported. For grapheme ɬʟ͓̥, BIPA ʟ̝̊ɬ is not supported. For grapheme ŋ̥ǂxˀ, BIPA ŋ̊ǂxˀ is not supported. For grapheme ɡv, BIPA gv is not supported. For grapheme ɡ̰ǂx, BIPA ŋǂx is not supported. For grapheme ɡ̰ǃx, BIPA ŋǃx is not supported. For grapheme ɣv, BIPA ɣv is not supported. For grapheme ʃt, BIPA ʃt is not supported. For grapheme ʍw, BIPA ʍw is not supported. For grapheme ɗʒ, BIPA ɗʒ is not supported. For grapheme ʀʁ, BIPA ʀʁ is not supported.

cormacanderson commented 3 years ago

With my understanding of the changes we have made, the following should be parsed: SHOULD BE PARSED For grapheme , BIPA is not supported.

SHOULD BE PARSED AS ALIAS ACCORDING TO https://github.com/cldf-clts/clts/issues/61 For grapheme ð͇ˠ, BIPA ð͇ˠ is not supported.

SHOULD BE PARSED ACCORDING TO https://github.com/cldf-clts/clts/issues/44 For grapheme ŋ̥m̥, BIPA ŋ̥m̥ is not supported.

SHOULD BE PARSED ACCORDING TO https://github.com/cldf-clts/clts/issues/62 For grapheme bz, BIPA bz is not supported. For grapheme bzʷ, BIPA bzʷ is not supported. For grapheme b̤z̤, BIPA bzʱ is not supported. For grapheme mbz, BIPA ⁿbz is not supported. For grapheme ps, BIPA ps is not supported. For grapheme psʰ, BIPA psʰ is not supported. For grapheme psʷ, BIPA psʷ is not supported. For grapheme psʷʰ, BIPA psʷʰ is not supported. For grapheme pʃʰ, BIPA pʃʰ is not supported. For grapheme pʃʼ, BIPA pʃʼ is not supported.

SHOULD BE PARSED ACCORDING TO https://github.com/cldf-clts/clts/issues/45 For grapheme dr, BIPA dr is not supported. For grapheme , BIPA is not supported. For grapheme ndr, BIPA ⁿdr is not supported. For grapheme ndɾ, BIPA ⁿdɾ is not supported. For grapheme n̠t̠ʃɾ, BIPA ⁿtʃɾ̥ is not supported. For grapheme tr, BIPA tr is not supported. For grapheme , BIPA is not supported. For grapheme ɖr, BIPA ɖɽ is not supported. For grapheme ɖr̠, BIPA ɖr̠ is not supported. For grapheme ɖr̠͓, BIPA ɖr̠͓ is not supported. For grapheme ɖɽ, BIPA ɖɽ is not supported. For grapheme ɳɖr, BIPA ⁿɖɽ is not supported. For grapheme ɳɖr̠, BIPA ⁿɖɽ is not supported. For grapheme ɳɖɽ, BIPA ⁿɖɽ is not supported. For grapheme ɳʈr̠̥, BIPA ⁿʈr̠̥ is not supported. For grapheme ⁿɖɽ, BIPA ⁿɖɽ is not supported. For grapheme ⁿʈɽʰ, BIPA ⁿʈɽʰ is not supported. For grapheme ʈr, BIPA ʈɽ is not supported. For grapheme ʈr̠̥, BIPA ʈr̠̥ is not supported. For grapheme ʈɹ̠̥, BIPA ʈɹ̠̥ is not supported. For grapheme ʈɽ, BIPA ʈɽ is not supported. For grapheme ʈɽʰ, BIPA ʈɽʰ is not supported.

cormacanderson commented 3 years ago

Specific mappings: UNRECOGNISED AFFRICATES (SHOULD BE PARSED?) For grapheme d̠ʒː, BIPA dːʒ is not supported. For grapheme d̠ːʒ, BIPA dːʒ is not supported. For grapheme d̪ʒ, BIPA d̪ʒ is not supported.

BIPA    GRAPHEME
d̠ʒː    d̠ʒː
d̠ʒː    d̠ːʒ
dʒ  d̪ʒ

A SPECIFIC MAPPING For grapheme ˀy, BIPA ˀy is not supported.

BIPA    GRAPHEME
ˀj  ˀy
cormacanderson commented 3 years ago

There is a failure to recognise the laminal diacritic. Should I put up an issue? For grapheme d̪ð̪, BIPA dð̪ is not supported. For grapheme t̪θ̪, BIPA tθ̪ is not supported. For grapheme ð̪̺, BIPA ð̪̺ is not supported. For grapheme ð̪̙ˤ, BIPA ð̙ˤ is not supported. (for this one, the RTR diacritic should anyway be supported For grapheme ɫ̪, BIPA ɫ̪ is not supported. For grapheme ɹ̪̹̩, BIPA ɹ̪̹̩ is not supported. For grapheme ˀt̪ɬ, BIPA ˀt̪ɬ is not supported.

cormacanderson commented 3 years ago

The following are dealt with according to https://github.com/cldf-clts/clts/issues/51 and https://github.com/cldf-clts/clts/issues/61. I have provided specific mappings below.

For grapheme , BIPA is not supported. For grapheme , BIPA is not supported. For grapheme z̪͇|z͇, BIPA is not supported. For grapheme ts͇, BIPA ts͇ is not supported. For grapheme t͇s͇, BIPA t͇s͇ is not supported. For grapheme ts͇ʰ, BIPA ts͇ʰ is not supported. For grapheme t͇s͇ʰ, BIPA t͇s͇ʰ is not supported. For grapheme ⁿt͇s͇ʰ, BIPA ⁿt͇s͇ʰ is not supported. For grapheme d͇z͇, BIPA dz͇ is not supported. For grapheme ⁿd͇z͇, BIPA ⁿd͇z͇ is not supported. For grapheme ʃ͇, BIPA ʃ͇ is not supported. For grapheme ʒ͇, BIPA ʒ͇ is not supported. For grapheme ʂ͇, BIPA ʂ͇ is not supported. For grapheme ʐ͇, BIPA ʐ͇ is not supported. For grapheme ʈʂ͇, BIPA ʈʂ͇ is not supported.

BIPA    GRAPHEME
θ̠  s͇
ð̠  z͇
ð̠  z̪͇|z͇
tθ̠ ts͇
tθ̠ t͇s͇
tθ̠ʰ    ts͇ʰ
tθ̠ʰ    t͇s͇ʰ
ⁿtθ̠ʰ   ⁿt͇s͇ʰ
dð̠ d͇z͇
ⁿdð̠    ⁿd͇z͇
ɹ̠̊˔    ʃ͇
ɹ̠˔ ʒ͇
ɻ̝̊ ʂ͇
ɻ̝  ʐ͇
ʈɻ̝̊    ʈʂ͇
tresoldi commented 3 years ago

I think at least some of those might be cases of valid sounds (i.e., accepted by the model) which are not listed among vowels/consonants. I will check each one of them.

LinguList commented 3 years ago

@cormacanderson, I do not by no means understand what you mentioned here. As all of these cases are now mapped, I mapped them manually, e.g. dr, which is represented as d+superscript r, and bz is b+superscript s.

LinguList commented 3 years ago

E.g., for your bz cases, you find them all here.

BIPA GRAPHEME
bz
bˢʷ bzʷ
ⁿbˢ mbz
LinguList commented 3 years ago

So this is all fine for me and can be merged now.

LinguList commented 3 years ago

Or am I missing something here? I thought we had all these covered now by explicit mappings.

cormacanderson commented 3 years ago

Yes, but these are all in the phoible.not_bipa.tsv above, where it says " For grapheme bz, BIPA bz is not supported.". Also they are not in the replacements file. So I flag them here.

tresoldi commented 3 years ago

The lists mean that the grapheme, as in the previously suggested lists, are not parsed (i.e. clts.bipa[grapheme] fails). Either we add them as aliases, or we replace the mapping. I will go through them.Il 21 Nov 2020 16:00, Cormac Anderson notifications@github.com ha scritto: Yes, but these are all in the phoible.not_bipa.tsv above, where it says " For grapheme bz, BIPA bz is not supported." so I flag them here.

—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or unsubscribe.

cormacanderson commented 3 years ago

Ah, now I understand. We replace the mapping then. More efficient here, it seems to me, would be for me to go through the phoible.tsv file I provided earlier and make the mappings again. Many of them I made before we had decided on certain issues, such as this one.

Note that not all of the issues flagged above are problems of this nature. Still problems, for example, are the write order inconsistencies, or the following: OK but possibly diagnostic of problems (where do ⁿdɹ̠˔ʷ and ɹ̪ come from in the first place?): Raw grapheme n̠d̠ʒʷ mapping was replaced from ⁿdɹ̠˔ʷ to ⁿdʒʷ. Raw grapheme ð̞ mapping was replaced from ɹ̪ to ð̞.

NOT OK and probably diagnostic of problems: Raw grapheme ⁿt̪s̪ʰ mapping was replaced from ⁿts̪ʰ to ⁿtθʰ. Raw grapheme n̪t̪s̪ʰ mapping was replaced from ⁿts̪ʰ to ⁿtθʰ.

LinguList commented 3 years ago

Guys, you are giving me a hard time here. I repeat once more the mapping procedure:

  1. try to map automatically
  2. use the algorithm for mapping (not for BIPA) to GUESS extended cases, e.g., mt -> superscript-n+t, and flag them in the automatically mapped file
  3. correct these mappings by hand
  4. submit the corrected file

We will never accept nt in CLTS as a valid sound in bipa, but our extended algorithm for automated mapping of transcription DATA can guess it, and we can make an explicit mapping.

So please, please, please, let us really pay attention to this procedure, and not dream of making the BIPA acceptance rate higher for things that would blow up the system. The BIPA should stay at is for MOST of its part, only individual mappings are allowed to be modified at this stage!

LinguList commented 3 years ago

And by individual mappings I mean this: you go to the file source/phoible/graphemes.tsv and correct a supposedly erroneous mappings there, or replace a <NA> by something else, but if you do so, it has to be checked then a second time to see if this new suggestions can be correctly parsed by our BIPA algorithm. If that IS the case, we accept it, and we run clts make_dataset phoible to copy the file from sources/phoible/graphemes.tsv into pkg/transcriptiondata/phoible.tsv. And that is then the authoritative mapping for a given version.

LinguList commented 3 years ago

@cormacanderson, you NEED to pay more attention to the core of what BIPA does, and stop seeing it as some magic thing. So if you ask yourself about

OK but possibly diagnostic of problems (where do ⁿdɹ̠˔ʷ and ɹ̪ come from in the first place?):

Raw grapheme ð̞ mapping was replaced from ɹ̪ to ð̞.

Please search for the elements in our file consonants.tsv.

In this file, you FIND that you gave me exactly the same definition for both sounds, namely:

GRAPHEME PHONATION PLACE MANNER ALIAS EXTRA NOTE
ɹ̪ voiced dental approximant  
ð̞ voiced dental approximant +

If you tell me what the difference between the two is, we can delete the Alias for one, but after you insisted me to include the first one, I had to correct this, since the code tells me there are two sounds with different spellings bu the same features (which is why we have the code checking).

These things have a reason, and it is crucial to understand the algorithm to account for them.

If you are not happy with the order: the order is automated, so you need to modify the order again in the code.

LinguList commented 3 years ago

It will be more sustainable if we all talk about the files, not about the system as some abstract thingy here. Everything has a reason, the reason is in those files.

cormacanderson commented 3 years ago

First off, I'm fine following the procedure and doing the mapping in source/phoible/graphemes.tsv. I will download that file then, play around with the mappings and put in a PR then. Does that work?

I don't see BIPA as magic and am trying to get a handle on why certain things happen. That's not always easy though. You're quite right with those dental approximants, and now that I look at it, I suppose fair enough. After all, we are not dealing with rhotic as a feature.

Other cases remain problematic though. Why for example is ⁿt̪s̪ʰ replaced from ⁿts̪ʰ to ⁿtθʰ? After all, in https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/consonants.tsv we have ts̪ | voiceless | dental | affricate |   | airstream:sibilant Why is this replaced by the non-sibilant dental?

A separate issue. At https://lingpy.org/clts/ I find , but it's not in https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/consonants.tsv. Should I add this manually and put in a PR, or how should I proceed here?

I will post new comments now with problems that are not ones that can simply be resolved by mapping to source/phoible/graphemes.tsv. Does this work?

LinguList commented 3 years ago

@cormacanderson: the WHY is not BIPA but what I did mostly manually, and which @tresoldi confirmed. If you see this as wrong, you should definitely make a PR. @tresoldi will then CHECK that PR, by running through and checking if your suggestions are correct with respect to BIPA and discuss those parts. I did my best, but I may have easily made some errors there, it was a long week, after all...

cormacanderson commented 3 years ago

@LinguList I'm not criticising anything you or anyone else has done. As you pointed out yourself, I made a mistake above too. As you say, it's a long week and we're both human. This is a tricky business and I'm not one to throw stones, just trying to get to the bottom of issues.

I'm just looking for the correct way to proceed that will get this done as well as possible, with a minimal of effort and repeating work. However, I don't necessarily know the best way to proceed here and am asking for guidance.

If I download source/phoible/graphemes.tsv and https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/consonants.tsv and add the necessary lines to them and then put up a new PR, does that work?

If not, then please advise. If so, do we merge this PR first or not?

LinguList commented 3 years ago

@cormacanderson, all fine, I think we are on the same page now. I'd prefer you to start ONLY with the file sources/phoible/graphemes.tsv, and see WHAT cases you think are wrongly mapped there. If you correct those, you can then send the file via email to @tresoldi, who now can do all the workflow, an who'd check that this is all working correctly. After this has been done, any sounds that are NOT available by BIPA and which would have to be added can be added there, but I'd suggest this as a second step.

BTW: I'll update https://lingpy.org/clts/ now, so you have an idea of the most recent version of bipa, with Aliases, which is probably also useful (!). It also lists which sounds are treated the same in phoible, and you can click on symbols ot see their unicode.

cormacanderson commented 3 years ago

Okay. Brief amendment to what you suggest there. I will first list the sounds here that I believe need to be: 1) added at source/phoible/graphemes.tsv, as a list for myself to work through 2) updated in BIPA, e.g. kɬ 3) not covered by us.

Then we have a provisional list (2) for after I amend source/phoible/graphemes.tsv. Seeing as I have already identified 3) here, it makes no sense to repeat this and possibly miss things.

@tresoldi are you good with this workflow?

LinguList commented 3 years ago

Yep, for kɬ, you could make an issue, I'll update and check the same time? If you provide in the table format, this would be great, as it is easier to add this, be it that I add it or that @tresoldi does it.

LinguList commented 3 years ago

BTW, @cormacanderson, kɬ IS in clts, and it is listed as Alias for kʟ̥, please check with https://lingpy.org/clts/

cormacanderson commented 3 years ago

Yes, I know it is CLTS (as I said above "At https://lingpy.org/clts/ I find kɬ, but it's not in https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/consonants.tsv.") So I should deal with this by changing the mapping in source/phoible/graphemes.tsv to kʟ̥?

cormacanderson commented 3 years ago

Again then, with the PHOIBLE replacements file. We have to check that the following does not recur after I change source/phoible/graphemes.tsv: Raw grapheme ⁿt̪s̪ʰ mapping was replaced from ⁿts̪ʰ to ⁿtθʰ. Raw grapheme n̪t̪s̪ʰ mapping was replaced from ⁿts̪ʰ to ⁿtθʰ.

The diacritic order is still wrong, according to what was agreed in https://github.com/cldf-clts/clts/issues/54. I will reopen this issue.

I have opened https://github.com/cldf-clts/clts/issues/74 to discuss diacritic readability in a number of combinations.

cormacanderson commented 3 years ago

There are four sounds that need to be added from PHOIBLE. I set up the issue here https://github.com/cldf-clts/clts/issues/75.

I don't understand why the trill-release and sibilant-release segments had the correct BIPA form in the .tsv file here but also came up as not supported above? Why is this happening?

I will email you the .tsv file now @tresoldi

LAPSyD and JIPA will follow (and update any issues accordingly).

cormacanderson commented 3 years ago

From the JIPA files there is very little, but I don't understand why we find Raw grapheme ʔnd mapping was replaced from ˀⁿd to ˀd. Raw grapheme ʔŋg mapping was replaced from ˀⁿg to ˀg. These are both in https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/consonants.tsv.

I wonder, also looking at the LAPSyD graphemes.tsv if the above is based off the latest version of the data? Most problems seem resolved there. There is even less for LAPSyD.

I'll send you these files now @tresoldi.

cormacanderson commented 3 years ago

Okay, that is all three graphemes.tsv files sent off. One reopened issue for the diacritics https://github.com/cldf-clts/clts/issues/54, and one new one for the four new sounds https://github.com/cldf-clts/clts/issues/75.

LinguList commented 3 years ago

Sorry if this was not clear: yes, you SHOULD change the mapping in phoible/graphemes.tsv

LinguList commented 3 years ago

The JIPA cases you mention, @cormacanderson, are indeed a wrong mapping. Probably human error, as the prenasalization plus glottalization is currently handled as a single feature.

cormacanderson commented 3 years ago

I did so re the mapping in phoible/graphemes.tsv. Re the JIPA sounds, yes, maybe, but these are is correct in https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/consonants.tsv.

ˀⁿg | voiced | velar | stop |   | preceding:pre-glottalized-and-nasalized

As I say, I'm not sure if this is all based off the most recent version of the data, given that there are replacements of things that are in consonants.tsv and that PHOIBLE mappings to sounds that are in consonants.tsv are not being supported. Perhaps we can try again with the new .tsv files and the most recent state of https://github.com/cldf-clts/clts/tree/master/pkg/transcriptionsystems/bipa.

LinguList commented 3 years ago

As I say, I'm not sure if this is all based off the most recent version of the data, given that there are replacements of things that are in consonants.tsv and that PHOIBLE mappings to sounds that are in consonants.tsv are not being supported.

What you also need to understand, @cormacanderson, is that the mapping procedure is ONLY automated if the spot is not already filled in the column BIPA. So if the mapping is not re-done on all data, it will be left as is. It would be too much work to go and manually remap 400-500 sounds each time. However, contrasting a clean run with the manual run is a possibility.

tresoldi commented 3 years ago

Most important part: I will refine the PR with the comments here and the mappings by @cormacanderson , starting tomorrow.


For the discussion, it is important to stress the difference between the internal representations of sounds and the graphemes. And, on that matter, between the graphemes as given by an expert and their normalized versions.

The lists I uploaded indicate graphemes that are not accepted and graphemes that are changed due to normalization. The fact that a grapheme is not accepted does not necessarily mean that the model is failing: for example, while we had the "frication" feature for vowels, vowels with frication were not listed in vowels.tsv, so that parsing their graphemes was failing (this is now fixed). As for normalization, it explains some of the changes as well, for matters like order of diacritics or sounds that need to be represented with a different "base" sound plus some advanced, raised, etc. diacritic.

An example is the case of "bˢ" above. CLTS (the library) has no problem understanding it as a "with-sibilant-release voiced bilabial stop consonant", and is the grapheme it returns to such description. What happens is that the parser does not understand something like "bz" or "bs" so we need to provide that exact grapheme in the mapping. As @LinguList mentioned, the "BIPA" column is left untouched by the mapper (i.e., "clts map") when provided, but the mapper does not check if that expert-provided grapheme is parsed as intended or if it is the canonical form of the grapheme in CLTS. This is partly due to the mapper using a different algorithm for interpreting the grapheme (it is not BIPA, as it happens in a, so to call, "pre-bipa stage"), and it is up to us (in this case, me, refining these mappings) to take care of that.

While the lists seem long, there is not so much work left for finalizing them. We can get it sorted out soon, so that we can finally regenerate the data for the inventory study.

LinguList commented 3 years ago

As to the mapper, one important thing to mention here is that everything that is mapped, be it automatically written into the BIPA column or be it there as assigned by experts, are checked to be compatible with BIPA/CLTS. That, yes, is checked.

tresoldi commented 3 years ago

I will wait to merge, I am incorporating the last updates that @cormacanderson has sent me. Will ask for review again later.

tresoldi commented 3 years ago

My latest commit https://github.com/cldf-clts/clts/pull/73/commits/c589a6aab394a12ad6b714e63c506909dd9bf28c adds entries to consonants.tsv and now all entries manually corrected by @cormacanderson for Phoible, JIPA, and LAPSyD are accepted (down from the 129 for phoible, 5 for jipa, and 19 for lapsyd of the previous attempt).

All entries that are not accepted are explicitly marked with <NA>, and those are almost entirely stuff we are not going to add, such as triphthongs and creaky nasal tones.

I am asking for a new review so I can merge.

LinguList commented 3 years ago

Looks fine to me. Once we have this one merged, @tresoldi, I'd suggest to make another update where we add the symbols column that is now routinely added with pyclts (the one-by-one-listing of symbols in a grapheme, as discussed via email). I can start by providing one example. Will do so, once you merged this one.

tresoldi commented 3 years ago

@cormacanderson maybe you missed my message? You are commenting on the files on the left side, which are the old versions, the current files are the ones listed on the right.

tresoldi commented 3 years ago

GitHub interface can be confusing, I know, sorry for that...

cormacanderson commented 3 years ago

Sorry, I missed the messages folks, just seeing them now. I figured it out myself after a little while anyways, when I got down to the green lines without red counterparts. I have deleted the messages that weren't relevant. The issues above are however relevant.

tresoldi commented 3 years ago

I will fix /nm/ in LAPSyD and merge, then, so that we can proceed.

As for the order of the diacritics, the mappings are giving the "canonical" representation by pyclts. It is no problem to keep them as they are -- once/if the order is changed in pyclts, we can just run it over all graphemes and correct the order.

tresoldi commented 3 years ago

@LinguList I have corrected the issue in lapsyd, regenerated all three packages, and merged. Feel free to proceed.

I will be working on the other datasets meanwhile, in alphabetical order.

tresoldi commented 3 years ago

Thank you, @cormacanderson ! We raised the mapping a lot!