Closed tresoldi closed 4 years ago
First the PHOIBLE replacements.
OK
Raw grapheme ð̺̞
mapping was replaced from ɹ̪̺
to ð̞̺
.
Raw grapheme d̙ˤ
mapping was replaced from d̙
to d̙ˤ
.
Raw grapheme xʀ̥
mapping was replaced from <NA>
to xʀ̥
.
Raw grapheme ᴅ̪̰
mapping was replaced from <NA>
to ɾ̪̰
.
Raw grapheme ᴅ
mapping was replaced from ɾ
to ɾ̪
.
OK frictionalised diacritic:
Raw grapheme r̪͓|r͓
mapping was replaced from <NA>
to r͓
.
Raw grapheme u͓
mapping was replaced from ʮ
to u͓
.
Raw grapheme ʟ͓̥
mapping was replaced from ʟ̥͓
to ʟ͓̥
.
Raw grapheme ɯ͓
mapping was replaced from ɿ
to ɯ͓
.
Raw grapheme kǁ͓ʰ
mapping was replaced from ǁʰ͓
to ǁ͓ʰ
.
OK with acceptable loss of information:
Raw grapheme ɹ̪̩ˠ
mapping was replaced from ɹ̪̩ˠ
to ɹ̪ˠ
.
Raw grapheme ɾ̪̊
mapping was replaced from ɾ̪̥
to ɾ̪̊
.
Raw grapheme ɾ̪̊ʰ
mapping was replaced from ɾ̪̥ʰ
to ɾ̪̊ʰ
.
OK but possibly diagnostic of problems (where do ⁿdɹ̠˔ʷ
and ɹ̪
come from in the first place?):
Raw grapheme n̠d̠ʒʷ
mapping was replaced from ⁿdɹ̠˔ʷ
to ⁿdʒʷ
.
Raw grapheme ð̞
mapping was replaced from ɹ̪
to ð̞
.
NOT OK and probably diagnostic of problems:
Raw grapheme ⁿt̪s̪ʰ
mapping was replaced from ⁿts̪ʰ
to ⁿtθʰ
.
Raw grapheme n̪t̪s̪ʰ
mapping was replaced from ⁿts̪ʰ
to ⁿtθʰ
.
OK diacritic order (according to https://github.com/cldf-clts/clts/issues/54):
Raw grapheme pʲʰ
mapping was replaced from pʰʲ
to pʲʰ
.
Raw grapheme t̪ʲʰ
mapping was replaced from t̪ʰʲ
to t̪ʲʰ
.
NOT OK diacritic order (according to https://github.com/cldf-clts/clts/issues/54):
Raw grapheme pʷˠʰ
mapping was replaced from pʷˠʰ
to pʷʰˠ
.
Raw grapheme pˠʰ
mapping was replaced from pˠʰ
to pʰˠ
.
Raw grapheme pˤʰ
mapping was replaced from pˤʰ
to pʰˤ
.
Raw grapheme qʷˤʰ
mapping was replaced from qʷˤʰ
to qʷʰˤ
.
Raw grapheme qˤʰ
mapping was replaced from qˤʰ
to qʰˤ
.
Raw grapheme tsˤʰ
mapping was replaced from tsˤʰ
to tsʰˤ
.
Raw grapheme t̠ʃˤʰ
mapping was replaced from tʃˤʰ
to tʃʰˤ
.
Raw grapheme t̪ˠʰ
mapping was replaced from t̪ˠʰ
to t̪ʰˠ
.
Raw grapheme t̪ˠʰ|tˠʰ
mapping was replaced from <NA>
to tʰˠ
.
Raw grapheme t̪ˤʰ
mapping was replaced from t̪ˤʰ
to t̪ʰˤ
.
Raw grapheme ŋ̥ǃˠˀ
mapping was replaced from ŋǃˠˀ
to ŋǃˀˠ
.
ISSUE? Poor readability of diacritics (non-release diacritics on length):
Raw grapheme p͉ʲ
mapping was replaced from p̚ʲ
to pʲ̚
.
Raw grapheme t͉ʲ
mapping was replaced from t̚ʲ
to tʲ̚
.
Raw grapheme k͉ʷ
mapping was replaced from k̚ʷ
to kʷ̚
.
Raw grapheme ä̠ː
mapping was replaced from <NA>
to aː̈
.
Raw grapheme äː
mapping was replaced from äː
to aː̈
.
Raw grapheme a̟ː
mapping was replaced from a̟ː
to aː̟
.
ISSUE? Poor readability of diacritics (and maybe here the stop should have the devoicing diacritic?):
Raw grapheme d̥ʒ̥
mapping was replaced from dʒ̊
to dʒ̥
.
Raw grapheme d̥ʒ̊
mapping was replaced from dʒ̊
to dʒ̥
.
Now the PHOIBLE not BIPA, first the correctly PHOIBLE not BIPA.
TRIPHTHONGS
For grapheme aʊɪ
, BIPA aʊɪ
is not supported.
For grapheme eəɪ
, BIPA eəɪ
is not supported.
For grapheme iau
, BIPA iau
is not supported.
For grapheme iäu̽
, BIPA iäu̽
is not supported.
For grapheme iou
, BIPA iou
is not supported.
For grapheme iɑu
, BIPA iɑu
is not supported.
For grapheme iəɪ
, BIPA iəɪ
is not supported.
For grapheme iɛi̯
, BIPA iɛi̯
is not supported.
For grapheme i̯ai
, BIPA i̯ai
is not supported.
For grapheme i̯ai̯
, BIPA i̯ai̯
is not supported.
For grapheme i̯au
, BIPA i̯au
is not supported.
For grapheme i̯au̯
, BIPA i̯au̯
is not supported.
For grapheme i̯ei̯
, BIPA i̯ei̯
is not supported.
For grapheme i̯eu̯
, BIPA i̯eu̯
is not supported.
For grapheme i̯oi
, BIPA i̯oi
is not supported.
For grapheme i̯uaː
, BIPA i̯uaː
is not supported.
For grapheme i̯ui
, BIPA i̯ui
is not supported.
For grapheme i̯uo
, BIPA i̯uo
is not supported.
For grapheme i̯ũo
, BIPA i̯ũo
is not supported.
For grapheme i̯æi
, BIPA i̯æi
is not supported.
For grapheme i̯ɛi
, BIPA i̯ɛi
is not supported.
For grapheme i̯ɛi̯
, BIPA i̯ɛi̯
is not supported.
For grapheme oəɪ
, BIPA oəɪ
is not supported.
For grapheme uai
, BIPA uai
is not supported.
For grapheme uei
, BIPA uei
is not supported.
For grapheme uei̯
, BIPA uei̯
is not supported.
For grapheme uie
, BIPA uie
is not supported.
For grapheme uiɛ
, BIPA uiɛ
is not supported.
For grapheme uɑi
, BIPA uɑi
is not supported.
For grapheme uəi
, BIPA uəi
is not supported.
For grapheme u̯ai
, BIPA u̯ai
is not supported.
For grapheme u̯ai̯
, BIPA u̯ai̯
is not supported.
For grapheme u̯ei̯
, BIPA u̯ei̯
is not supported.
For grapheme u̯eu̯
, BIPA u̯eu̯
is not supported.
For grapheme u̯æi
, BIPA u̯æi
is not supported.
For grapheme u̯ɛi
, BIPA u̯ɛi
is not supported.
For grapheme u̯ɛi̯
, BIPA u̯ɛi̯
is not supported.
For grapheme yia
, BIPA yia
is not supported.
For grapheme əɪa
, BIPA əɪa
is not supported.
For grapheme əʊɪ
, BIPA əʊɪ
is not supported.
For grapheme ʊ̯aɪ̯
, BIPA ʊ̯aɪ̯
is not supported.
ILLICIT "CLUSTERS"
For grapheme d̪l̪
, BIPA d̪l̪
is not supported.
For grapheme st
, BIPA st
is not supported.
For grapheme s̙ˤ
, BIPA s̙ˤ
is not supported.
For grapheme s̻θ
, BIPA s͇θ
is not supported.
For grapheme t̪ʙ
, BIPA t̪ʙ
is not supported.
For grapheme r̠̙
, BIPA r̠̙
is not supported.
For grapheme fʃ
, BIPA fʃ
is not supported.
For grapheme kf
, BIPA kf
is not supported.
For grapheme kl
, BIPA kl
is not supported.
For grapheme ld
, BIPA ld
is not supported.
For grapheme l̠˞
, BIPA l̠˞
is not supported.
For grapheme m̥m
, BIPA m̥m
is not supported.
For grapheme n̥n
, BIPA n̥n
is not supported.
For grapheme ŋ̊ŋ
, BIPA ŋ̊ŋ
is not supported.
For grapheme xh
, BIPA xh
is not supported.
For grapheme xk
, BIPA xk
is not supported.
For grapheme ɬl
, BIPA ɬl
is not supported.
For grapheme ɬʟ͓̥
, BIPA ʟ̝̊ɬ
is not supported.
For grapheme ŋ̥ǂxˀ
, BIPA ŋ̊ǂxˀ
is not supported.
For grapheme ɡv
, BIPA gv
is not supported.
For grapheme ɡ̰ǂx
, BIPA ŋǂx
is not supported.
For grapheme ɡ̰ǃx
, BIPA ŋǃx
is not supported.
For grapheme ɣv
, BIPA ɣv
is not supported.
For grapheme ʃt
, BIPA ʃt
is not supported.
For grapheme ʍw
, BIPA ʍw
is not supported.
For grapheme ɗʒ
, BIPA ɗʒ
is not supported.
For grapheme ʀʁ
, BIPA ʀʁ
is not supported.
With my understanding of the changes we have made, the following should be parsed:
SHOULD BE PARSED
For grapheme kɬ
, BIPA kɬ
is not supported.
SHOULD BE PARSED AS ALIAS ACCORDING TO https://github.com/cldf-clts/clts/issues/61
For grapheme ð͇ˠ
, BIPA ð͇ˠ
is not supported.
SHOULD BE PARSED ACCORDING TO https://github.com/cldf-clts/clts/issues/44
For grapheme ŋ̥m̥
, BIPA ŋ̥m̥
is not supported.
SHOULD BE PARSED ACCORDING TO https://github.com/cldf-clts/clts/issues/62
For grapheme bz
, BIPA bz
is not supported.
For grapheme bzʷ
, BIPA bzʷ
is not supported.
For grapheme b̤z̤
, BIPA bzʱ
is not supported.
For grapheme mbz
, BIPA ⁿbz
is not supported.
For grapheme ps
, BIPA ps
is not supported.
For grapheme psʰ
, BIPA psʰ
is not supported.
For grapheme psʷ
, BIPA psʷ
is not supported.
For grapheme psʷʰ
, BIPA psʷʰ
is not supported.
For grapheme pʃʰ
, BIPA pʃʰ
is not supported.
For grapheme pʃʼ
, BIPA pʃʼ
is not supported.
SHOULD BE PARSED ACCORDING TO https://github.com/cldf-clts/clts/issues/45
For grapheme dr
, BIPA dr
is not supported.
For grapheme dɾ
, BIPA dɾ
is not supported.
For grapheme ndr
, BIPA ⁿdr
is not supported.
For grapheme ndɾ
, BIPA ⁿdɾ
is not supported.
For grapheme n̠t̠ʃɾ
, BIPA ⁿtʃɾ̥
is not supported.
For grapheme tr
, BIPA tr
is not supported.
For grapheme tɾ
, BIPA tɾ
is not supported.
For grapheme ɖr
, BIPA ɖɽ
is not supported.
For grapheme ɖr̠
, BIPA ɖr̠
is not supported.
For grapheme ɖr̠͓
, BIPA ɖr̠͓
is not supported.
For grapheme ɖɽ
, BIPA ɖɽ
is not supported.
For grapheme ɳɖr
, BIPA ⁿɖɽ
is not supported.
For grapheme ɳɖr̠
, BIPA ⁿɖɽ
is not supported.
For grapheme ɳɖɽ
, BIPA ⁿɖɽ
is not supported.
For grapheme ɳʈr̠̥
, BIPA ⁿʈr̠̥
is not supported.
For grapheme ⁿɖɽ
, BIPA ⁿɖɽ
is not supported.
For grapheme ⁿʈɽʰ
, BIPA ⁿʈɽʰ
is not supported.
For grapheme ʈr
, BIPA ʈɽ
is not supported.
For grapheme ʈr̠̥
, BIPA ʈr̠̥
is not supported.
For grapheme ʈɹ̠̥
, BIPA ʈɹ̠̥
is not supported.
For grapheme ʈɽ
, BIPA ʈɽ
is not supported.
For grapheme ʈɽʰ
, BIPA ʈɽʰ
is not supported.
Specific mappings:
UNRECOGNISED AFFRICATES (SHOULD BE PARSED?)
For grapheme d̠ʒː
, BIPA dːʒ
is not supported.
For grapheme d̠ːʒ
, BIPA dːʒ
is not supported.
For grapheme d̪ʒ
, BIPA d̪ʒ
is not supported.
BIPA GRAPHEME
d̠ʒː d̠ʒː
d̠ʒː d̠ːʒ
dʒ d̪ʒ
A SPECIFIC MAPPING
For grapheme ˀy
, BIPA ˀy
is not supported.
BIPA GRAPHEME
ˀj ˀy
There is a failure to recognise the laminal diacritic. Should I put up an issue?
For grapheme d̪ð̪
, BIPA dð̪
is not supported.
For grapheme t̪θ̪
, BIPA tθ̪
is not supported.
For grapheme ð̪̺
, BIPA ð̪̺
is not supported.
For grapheme ð̪̙ˤ
, BIPA ð̙ˤ
is not supported. (for this one, the RTR diacritic should anyway be supported
For grapheme ɫ̪
, BIPA ɫ̪
is not supported.
For grapheme ɹ̪̹̩
, BIPA ɹ̪̹̩
is not supported.
For grapheme ˀt̪ɬ
, BIPA ˀt̪ɬ
is not supported.
The following are dealt with according to https://github.com/cldf-clts/clts/issues/51 and https://github.com/cldf-clts/clts/issues/61. I have provided specific mappings below.
For grapheme s͇
, BIPA s͇
is not supported.
For grapheme z͇
, BIPA z͇
is not supported.
For grapheme z̪͇|z͇
, BIPA z͇
is not supported.
For grapheme ts͇
, BIPA ts͇
is not supported.
For grapheme t͇s͇
, BIPA t͇s͇
is not supported.
For grapheme ts͇ʰ
, BIPA ts͇ʰ
is not supported.
For grapheme t͇s͇ʰ
, BIPA t͇s͇ʰ
is not supported.
For grapheme ⁿt͇s͇ʰ
, BIPA ⁿt͇s͇ʰ
is not supported.
For grapheme d͇z͇
, BIPA dz͇
is not supported.
For grapheme ⁿd͇z͇
, BIPA ⁿd͇z͇
is not supported.
For grapheme ʃ͇
, BIPA ʃ͇
is not supported.
For grapheme ʒ͇
, BIPA ʒ͇
is not supported.
For grapheme ʂ͇
, BIPA ʂ͇
is not supported.
For grapheme ʐ͇
, BIPA ʐ͇
is not supported.
For grapheme ʈʂ͇
, BIPA ʈʂ͇
is not supported.
BIPA GRAPHEME
θ̠ s͇
ð̠ z͇
ð̠ z̪͇|z͇
tθ̠ ts͇
tθ̠ t͇s͇
tθ̠ʰ ts͇ʰ
tθ̠ʰ t͇s͇ʰ
ⁿtθ̠ʰ ⁿt͇s͇ʰ
dð̠ d͇z͇
ⁿdð̠ ⁿd͇z͇
ɹ̠̊˔ ʃ͇
ɹ̠˔ ʒ͇
ɻ̝̊ ʂ͇
ɻ̝ ʐ͇
ʈɻ̝̊ ʈʂ͇
I think at least some of those might be cases of valid sounds (i.e., accepted by the model) which are not listed among vowels/consonants. I will check each one of them.
@cormacanderson, I do not by no means understand what you mentioned here. As all of these cases are now mapped, I mapped them manually, e.g. dr, which is represented as d+superscript r, and bz is b+superscript s.
E.g., for your bz
cases, you find them all here.
BIPA | GRAPHEME |
---|---|
bˢ | bz |
bˢʷ | bzʷ |
ⁿbˢ | mbz |
So this is all fine for me and can be merged now.
Or am I missing something here? I thought we had all these covered now by explicit mappings.
Yes, but these are all in the phoible.not_bipa.tsv above, where it says " For grapheme bz, BIPA bz is not supported.". Also they are not in the replacements file. So I flag them here.
The lists mean that the grapheme, as in the previously suggested lists, are not parsed (i.e. clts.bipa[grapheme] fails). Either we add them as aliases, or we replace the mapping. I will go through them.Il 21 Nov 2020 16:00, Cormac Anderson notifications@github.com ha scritto: Yes, but these are all in the phoible.not_bipa.tsv above, where it says " For grapheme bz, BIPA bz is not supported." so I flag them here.
—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or unsubscribe.
Ah, now I understand. We replace the mapping then. More efficient here, it seems to me, would be for me to go through the phoible.tsv file I provided earlier and make the mappings again. Many of them I made before we had decided on certain issues, such as this one.
Note that not all of the issues flagged above are problems of this nature. Still problems, for example, are the write order inconsistencies, or the following: OK but possibly diagnostic of problems (where do ⁿdɹ̠˔ʷ and ɹ̪ come from in the first place?): Raw grapheme n̠d̠ʒʷ mapping was replaced from ⁿdɹ̠˔ʷ to ⁿdʒʷ. Raw grapheme ð̞ mapping was replaced from ɹ̪ to ð̞.
NOT OK and probably diagnostic of problems: Raw grapheme ⁿt̪s̪ʰ mapping was replaced from ⁿts̪ʰ to ⁿtθʰ. Raw grapheme n̪t̪s̪ʰ mapping was replaced from ⁿts̪ʰ to ⁿtθʰ.
Guys, you are giving me a hard time here. I repeat once more the mapping procedure:
We will never accept nt in CLTS as a valid sound in bipa, but our extended algorithm for automated mapping of transcription DATA can guess it, and we can make an explicit mapping.
So please, please, please, let us really pay attention to this procedure, and not dream of making the BIPA acceptance rate higher for things that would blow up the system. The BIPA should stay at is for MOST of its part, only individual mappings are allowed to be modified at this stage!
And by individual mappings I mean this: you go to the file source/phoible/graphemes.tsv
and correct a supposedly erroneous mappings there, or replace a <NA>
by something else, but if you do so, it has to be checked then a second time to see if this new suggestions can be correctly parsed by our BIPA algorithm. If that IS the case, we accept it, and we run clts make_dataset phoible
to copy the file from sources/phoible/graphemes.tsv
into pkg/transcriptiondata/phoible.tsv
. And that is then the authoritative mapping for a given version.
@cormacanderson, you NEED to pay more attention to the core of what BIPA does, and stop seeing it as some magic thing. So if you ask yourself about
OK but possibly diagnostic of problems (where do ⁿdɹ̠˔ʷ and ɹ̪ come from in the first place?):
Raw grapheme ð̞ mapping was replaced from ɹ̪ to ð̞.
Please search for the elements in our file consonants.tsv.
In this file, you FIND that you gave me exactly the same definition for both sounds, namely:
GRAPHEME | PHONATION | PLACE | MANNER | ALIAS | EXTRA | NOTE |
---|---|---|---|---|---|---|
ɹ̪ | voiced | dental | approximant | |||
ð̞ | voiced | dental | approximant | + |
If you tell me what the difference between the two is, we can delete the Alias for one, but after you insisted me to include the first one, I had to correct this, since the code tells me there are two sounds with different spellings bu the same features (which is why we have the code checking).
These things have a reason, and it is crucial to understand the algorithm to account for them.
If you are not happy with the order: the order is automated, so you need to modify the order again in the code.
It will be more sustainable if we all talk about the files, not about the system as some abstract thingy here. Everything has a reason, the reason is in those files.
First off, I'm fine following the procedure and doing the mapping in source/phoible/graphemes.tsv
. I will download that file then, play around with the mappings and put in a PR then. Does that work?
I don't see BIPA as magic and am trying to get a handle on why certain things happen. That's not always easy though. You're quite right with those dental approximants, and now that I look at it, I suppose fair enough. After all, we are not dealing with rhotic as a feature.
Other cases remain problematic though. Why for example is ⁿt̪s̪ʰ
replaced from ⁿts̪ʰ
to ⁿtθʰ
? After all, in https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/consonants.tsv
we have
ts̪ | voiceless | dental | affricate | | airstream:sibilant
Why is this replaced by the non-sibilant dental?
A separate issue. At https://lingpy.org/clts/
I find kɬ
, but it's not in https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/consonants.tsv
. Should I add this manually and put in a PR, or how should I proceed here?
I will post new comments now with problems that are not ones that can simply be resolved by mapping to source/phoible/graphemes.tsv
. Does this work?
@cormacanderson: the WHY is not BIPA but what I did mostly manually, and which @tresoldi confirmed. If you see this as wrong, you should definitely make a PR. @tresoldi will then CHECK that PR, by running through and checking if your suggestions are correct with respect to BIPA and discuss those parts. I did my best, but I may have easily made some errors there, it was a long week, after all...
@LinguList I'm not criticising anything you or anyone else has done. As you pointed out yourself, I made a mistake above too. As you say, it's a long week and we're both human. This is a tricky business and I'm not one to throw stones, just trying to get to the bottom of issues.
I'm just looking for the correct way to proceed that will get this done as well as possible, with a minimal of effort and repeating work. However, I don't necessarily know the best way to proceed here and am asking for guidance.
If I download source/phoible/graphemes.tsv
and https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/consonants.tsv
and add the necessary lines to them and then put up a new PR, does that work?
If not, then please advise. If so, do we merge this PR first or not?
@cormacanderson, all fine, I think we are on the same page now. I'd prefer you to start ONLY with the file sources/phoible/graphemes.tsv, and see WHAT cases you think are wrongly mapped there. If you correct those, you can then send the file via email to @tresoldi, who now can do all the workflow, an who'd check that this is all working correctly. After this has been done, any sounds that are NOT available by BIPA and which would have to be added can be added there, but I'd suggest this as a second step.
BTW: I'll update https://lingpy.org/clts/ now, so you have an idea of the most recent version of bipa, with Aliases, which is probably also useful (!). It also lists which sounds are treated the same in phoible, and you can click on symbols ot see their unicode.
Okay. Brief amendment to what you suggest there. I will first list the sounds here that I believe need to be: 1) added at source/phoible/graphemes.tsv, as a list for myself to work through 2) updated in BIPA, e.g. kɬ 3) not covered by us.
Then we have a provisional list (2) for after I amend source/phoible/graphemes.tsv. Seeing as I have already identified 3) here, it makes no sense to repeat this and possibly miss things.
@tresoldi are you good with this workflow?
Yep, for kɬ, you could make an issue, I'll update and check the same time? If you provide in the table format, this would be great, as it is easier to add this, be it that I add it or that @tresoldi does it.
BTW, @cormacanderson, kɬ IS in clts, and it is listed as Alias for kʟ̥, please check with https://lingpy.org/clts/
Yes, I know it is CLTS (as I said above "At https://lingpy.org/clts/ I find kɬ, but it's not in https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/consonants.tsv.") So I should deal with this by changing the mapping in source/phoible/graphemes.tsv to kʟ̥?
Again then, with the PHOIBLE replacements file. We have to check that the following does not recur after I change source/phoible/graphemes.tsv: Raw grapheme ⁿt̪s̪ʰ mapping was replaced from ⁿts̪ʰ to ⁿtθʰ. Raw grapheme n̪t̪s̪ʰ mapping was replaced from ⁿts̪ʰ to ⁿtθʰ.
The diacritic order is still wrong, according to what was agreed in https://github.com/cldf-clts/clts/issues/54. I will reopen this issue.
I have opened https://github.com/cldf-clts/clts/issues/74 to discuss diacritic readability in a number of combinations.
There are four sounds that need to be added from PHOIBLE. I set up the issue here https://github.com/cldf-clts/clts/issues/75.
I don't understand why the trill-release and sibilant-release segments had the correct BIPA form in the .tsv file here but also came up as not supported above? Why is this happening?
I will email you the .tsv file now @tresoldi
LAPSyD and JIPA will follow (and update any issues accordingly).
From the JIPA files there is very little, but I don't understand why we find
Raw grapheme ʔnd
mapping was replaced from ˀⁿd
to ˀd
.
Raw grapheme ʔŋg
mapping was replaced from ˀⁿg
to ˀg
.
These are both in https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/consonants.tsv.
I wonder, also looking at the LAPSyD graphemes.tsv if the above is based off the latest version of the data? Most problems seem resolved there. There is even less for LAPSyD.
I'll send you these files now @tresoldi.
Okay, that is all three graphemes.tsv files sent off. One reopened issue for the diacritics https://github.com/cldf-clts/clts/issues/54, and one new one for the four new sounds https://github.com/cldf-clts/clts/issues/75.
Sorry if this was not clear: yes, you SHOULD change the mapping in phoible/graphemes.tsv
The JIPA cases you mention, @cormacanderson, are indeed a wrong mapping. Probably human error, as the prenasalization plus glottalization is currently handled as a single feature.
I did so re the mapping in phoible/graphemes.tsv. Re the JIPA sounds, yes, maybe, but these are is correct in https://github.com/cldf-clts/clts/blob/master/pkg/transcriptionsystems/bipa/consonants.tsv.
ˀⁿg | voiced | velar | stop | | preceding:pre-glottalized-and-nasalized
As I say, I'm not sure if this is all based off the most recent version of the data, given that there are replacements of things that are in consonants.tsv and that PHOIBLE mappings to sounds that are in consonants.tsv are not being supported. Perhaps we can try again with the new .tsv files and the most recent state of https://github.com/cldf-clts/clts/tree/master/pkg/transcriptionsystems/bipa.
As I say, I'm not sure if this is all based off the most recent version of the data, given that there are replacements of things that are in consonants.tsv and that PHOIBLE mappings to sounds that are in consonants.tsv are not being supported.
What you also need to understand, @cormacanderson, is that the mapping procedure is ONLY automated if the spot is not already filled in the column BIPA. So if the mapping is not re-done on all data, it will be left as is. It would be too much work to go and manually remap 400-500 sounds each time. However, contrasting a clean run with the manual run is a possibility.
Most important part: I will refine the PR with the comments here and the mappings by @cormacanderson , starting tomorrow.
For the discussion, it is important to stress the difference between the internal representations of sounds and the graphemes. And, on that matter, between the graphemes as given by an expert and their normalized versions.
The lists I uploaded indicate graphemes that are not accepted and graphemes that are changed due to normalization. The fact that a grapheme is not accepted does not necessarily mean that the model is failing: for example, while we had the "frication" feature for vowels, vowels with frication were not listed in vowels.tsv
, so that parsing their graphemes was failing (this is now fixed). As for normalization, it explains some of the changes as well, for matters like order of diacritics or sounds that need to be represented with a different "base" sound plus some advanced, raised, etc. diacritic.
An example is the case of "bˢ" above. CLTS (the library) has no problem understanding it as a "with-sibilant-release voiced bilabial stop consonant", and bˢ
is the grapheme it returns to such description. What happens is that the parser does not understand something like "bz" or "bs" so we need to provide that exact grapheme in the mapping. As @LinguList mentioned, the "BIPA" column is left untouched by the mapper (i.e., "clts map") when provided, but the mapper does not check if that expert-provided grapheme is parsed as intended or if it is the canonical form of the grapheme in CLTS. This is partly due to the mapper using a different algorithm for interpreting the grapheme (it is not BIPA, as it happens in a, so to call, "pre-bipa stage"), and it is up to us (in this case, me, refining these mappings) to take care of that.
While the lists seem long, there is not so much work left for finalizing them. We can get it sorted out soon, so that we can finally regenerate the data for the inventory study.
As to the mapper, one important thing to mention here is that everything that is mapped, be it automatically written into the BIPA column or be it there as assigned by experts, are checked to be compatible with BIPA/CLTS. That, yes, is checked.
I will wait to merge, I am incorporating the last updates that @cormacanderson has sent me. Will ask for review again later.
My latest commit https://github.com/cldf-clts/clts/pull/73/commits/c589a6aab394a12ad6b714e63c506909dd9bf28c adds entries to consonants.tsv
and now all entries manually corrected by @cormacanderson for Phoible, JIPA, and LAPSyD are accepted (down from the 129 for phoible, 5 for jipa, and 19 for lapsyd of the previous attempt).
All entries that are not accepted are explicitly marked with <NA>
, and those are almost entirely stuff we are not going to add, such as triphthongs and creaky nasal tones.
I am asking for a new review so I can merge.
Looks fine to me. Once we have this one merged, @tresoldi, I'd suggest to make another update where we add the symbols column that is now routinely added with pyclts (the one-by-one-listing of symbols in a grapheme, as discussed via email). I can start by providing one example. Will do so, once you merged this one.
@cormacanderson maybe you missed my message? You are commenting on the files on the left side, which are the old versions, the current files are the ones listed on the right.
GitHub interface can be confusing, I know, sorry for that...
Sorry, I missed the messages folks, just seeing them now. I figured it out myself after a little while anyways, when I got down to the green lines without red counterparts. I have deleted the messages that weren't relevant. The issues above are however relevant.
I will fix /nm/
in LAPSyD and merge, then, so that we can proceed.
As for the order of the diacritics, the mappings are giving the "canonical" representation by pyclts
. It is no problem to keep them as they are -- once/if the order is changed in pyclts
, we can just run it over all graphemes and correct the order.
@LinguList I have corrected the issue in lapsyd
, regenerated all three packages, and merged. Feel free to proceed.
I will be working on the other datasets meanwhile, in alphabetical order.
Thank you, @cormacanderson ! We raised the mapping a lot!
This PR incorporates, whenever possible, the previous mapping corrections by @cormacanderson to the
phoible
,jipa
, andlapsyd
sources. This is related to his comments here, following the list I had previously prepared here.Note that, in order to make the diff shorter and review easier, I am only updating the related
graphemes.tsv
files. After the changes are accepted I will regenerate the corresponding packages. Also note that, because of this, I am making a PR into thesources
branch, and not intomaster
.Not all mappings provided in those files could be used; most are related to things we are not supporting, such as triphthongs, but a handful are graphemes that are not parsed by
pyclts
as it is. The numbers of invalid graphemes are 129 forphoible
, 5 forjipa
, and 19 forlapsyd
. I am attaching here a detailed list of all such graphemes (phoible.not_bipa.txt, jipa.not_bipa.txt, and lapsyd.not_bipa.txt).In total, this PR corrects/changes/fixes 38 mappings for
phoible
, 13 forjipa
, and 11 forlapsyd
. I am attaching lists of replacements as well: phoible.replaced.txt, jipa.replaced.txt, and lapsyd.replaced.txt.