Closed cormacanderson closed 5 years ago
These updates should all be done until mid of next week, to allow for us to have some more time to actually check for other things etc.
The ligature tie, btw, is regularly deleted by clts:
In [9]: from pyclts.transcriptionsystem import *
In [11]: bipa = TranscriptionSystem('bipa')
In [12]: bipa['ʐ͡ɣ']
Out[12]: UnknownSound(ts=<pyclts.transcriptionsystem.TranscriptionSystem object at 0x7f38c733c7f0>, grapheme='ʐɣ', source='ʐ͡ɣ', generated=False, note=None)
The reason that sound is NOT accepted is that we restrict double-articulation to a certain number of sounds.
Re the ligature tie, in this case CLTS did not delete it, but rather threw up the entire combination as an error, so the problem is actually a question of whether to add the grapheme. Still editing here.
The syllabicity is a problem of using the wrong diacritic:
In [14]: c = bipa['ŋ̣']
In [15]: c.uname
Out[15]: 'LATIN SMALL LETTER ENG / COMBINING DOT BELOW'
In [15]: c.uname
Out[15]: 'LATIN SMALL LETTER ENG / COMBINING DOT BELOW'
In [23]: c2 = bipa['syllabic voiced alveolar nasal consonant']
In [24]: c2.s
Out[24]: 'n̩'
In [25]: c2.uname
Out[25]: 'LATIN SMALL LETTER N / COMBINING VERTICAL LINE BELOW'
So you are using the wrong character here. We can add it to our list of aliaases, but I would not like to do that, as the dot is used for meaning other things.
Re the ligature tie, in this case CLTS did not delete it, but rather threw up the entire combination as an error, so the problem is actually a question of whether to add the grapheme. Still editing here.
As I said, and if you see my code example above, you will see what CLTS does: we restrict double-articulattion things to stops and nasals and don't allow for all combinations, for in principle good reasons.
Actually, and this surprised me, but the dot is actually IPA standard for syllabicity. If we follow the IPA strictly, it is the combining vertical line below that should be the alias.
So, @tresoldi, as you'll look into this: please have a look at the way in which I just evaluated the problems, and decide afterwards in your PR. We may have to add new features, I can point to where to do so in the code, but we also need to check with the input, and see if it's an actual CLTS error, or why clts does not accept a certain sound, etc.
And @cormacanderson, following the installation instructions, you may want to test my code examples yourself, as they will give you a deeper impression of how these things are handled.
I'm checking with @cormacanderson , we'll decide later on a per-item basis (for example, the affricate thing might actually just be an error when ɡɣ is intended).
@LinguList we are still adding things here. I haven't even finished the description of the issue. This involves copy and pasting for various sources. Once I have done the copy and paste, I will edit.
I'm checking with @cormacanderson , we'll decide later on a per-item basis (for example, the affricate thing might actually just be an error when ɡɣ is intended).
I am not sure. chekc here pleae: http://www.lfsag.unito.it/ipa/index_en.html
and otherwise, point me to the IPA where the dot is intendet for this.
Re syllabicity: The dot used to be official IPA (as of 1996, see https://linguistlist.org/unicode/ipa.html ) but was replaced by the vertical line in subsequent versions (e.g. https://upload.wikimedia.org/wikipedia/commons/8/8e/IPA_chart_2018.pdf ). So having them as aliases would make sense from our perspective, unless the dot is reserved for some other purpose.
So if you start classifying the problems in tthese categories, e.g., as alias etc., this would already be very helpful.
@LinguList please be patient. As I said above, I am still setting up this issue. First I copy and paste, then I start sorting it out.
@LuPaschen okay I thought you said it was current IPA. Good to know. In that case, given the number of different ways in which it is used in the literature, e.g. for emphatics or retroflexes, I don't think it wise to set up the dot as an alias of the combining vertical line below. I will add this to the relevant SndCmp issue instead and it can be dealt with through an orthography profile or just as a search and replace.
I'm checking with @cormacanderson , we'll decide later on a per-item basis (for example, the affricate thing might actually just be an error when ɡɣ is intended).
I am not sure. chekc here pleae: http://www.lfsag.unito.it/ipa/index_en.html
I'm not finding ʐ͡ɣ there, is it the right link?
My guess of ɡɣ (with no bar) is due to both articulatory problems (moving from palatal to velar and to sibilant to non-sibilant in a single affricate seems too much, especially for distinctivity from a pure palatal affricate) and to the fact that it is attested in some Germanic dialects (like good in English Cockney, as per wiki). ʐ͡ɣ has almost no Google hits (even without the bar, i.e. "ʐɣ"), almost none related to phonology besides some uncited papers, and apparently is not used in any other language (not even Phoible has it). An additional point here is that a sibilant velar fricative, which would be expected in case of such affricate, is considered impossible in the IPA.
Given that it is not in IPA and that there is no scholar reference on the segment itself, I would reject the inclusion of such segment and keep the suggestion to check if the intended sound isn't actually /ɡɣ/ (unless some supporting material is provided, of course).
I have added ʐ͡ɣ to https://github.com/lingdb/Sound-Comparisons/issues/484
@LuPaschen I can't add you to that issue for some reason. Can you assign yourself?
I just figured it was important to inform you that a structured investigation is needed, and I thought it was nice to share some code examples. I'll ignore all of your discussions now and check with @tresoldi later, when you have your systematized proposals.
@cormacanderson I remember now that I did say that the dot was current IPA -- sorry for the confusion, the correct story is in my previous comment. I cannot assign myself to the #484 but I am watching so I get notifications.
@LuPaschen okay no problem. Your name doesn't pop up when I type @ either. Maybe you're not part of the lingdb consortium or the cldf one?
@LinguList finished editing the issue now, having added everything. I think it's systematic now but comments of course welcome. I will look through some of the other transcription datasets and add to this, but I think that already these changes will substantially improve our coverage. @tresoldi and I will meet again in the first half of next week to advance this.
2 things to add here:
Not sure if 2. is due to CLTS or if it has sth to do with the script, but it needs checking.
@LuPaschen thanks for the comments, although better in future to add such suggestions to https://github.com/cldf/clts/issues/120 (editing the description, with checkboxes, ideally). I will move them in here if necessary afterwards. Responding specifically:
@LuPaschen you know we have a web-interface that lists more or less the last version of clts? We can also update (in fact, we will soon): http://calc.digling.org/clts
The idea is that you manually segment your sounds and paste them to the field, so you see what clts makes out of it.
If it does not work, this means we have not found that sound segment in any existing database so far, so even if you'd say it is missing, it is potentially extremely rare (but we find also that with now 8000!!! different segments, it is not surprising that there are more things we cannot cover for now).
More importantly also: clicking on the sound will give you the unicode character, so this is a convenient way to check -- even if a sound is not yet accepted in clts -- how it is composed.
Centralisation is potentially only defined for vowels: https://github.com/cldf/clts/blob/master/src/pyclts/transcriptionsystems/bipa/diacritics.tsv
You can easily search here, mentioning a "feature", and you'll see that "centralized" is accepted, but only for vowels.
@LinguList Thanks for the link to the web-interface, this is really helpful. I checked the centralized and mid-centralized vowels from the examples above ("ä", "ɛ̽") and they are indeed not recognized by clts, which is unexpected given that they are listed under the diacritics in BIPA.
Ah, in fact, what I said is important: this is just what we HAVE in all datasets we added so far, as CLTS involves a rather complex algorithm that wouldn't run easily on a server. So this clts-app is just showing you what IS there reflected in datasets, while CLTS itself may be able to generate. Which you can check in an interactive python session, if you have clts installed:
>>> from pyclts.transcriptionsystem import TranscriptionSystem as TS
>>> bipa = TS('bipa')
>>> bipa['ä'].name
'unrounded open centralized vowel'
So this is accepted by CLTS, but it does apparently not occur in the data we considered to load that database.
Ok, thanks for the clarification. Since the disappearing diacritics are not due to CLTS then, I'll add this to tresoldi/soundcomparisons/issues .
@LuPaschen, let's be clear about "disappearing diacritics": this is what you mean is produced by the automatic segmentation iwth the question marks, right? This is a process that is due to you having not specified the sounds in the orthography profile. Do you have Python running on your computer and could make an interactive python session? If so, I can give you instructions on how to test all of this quickly on your computer, so you'll have it easier to understand where your orthography profile fails and where clts is missing something...
@LuPaschen, just confirmed that: the orthography profile is missing this. So it is on your side, even not on the side of the code that @bibiko writes there.
Who wrote that profile? We didn't.
https://github.com/tresoldi/soundcomparisons/blob/master/profile.tsv
this is an automatically generated profile, and your task is to update it if it fails, since the segmentation is a process independent of evaluation by clts (obviously). This also explains why you do not get the pre-aspirated stuff.
So the workflow is:
So just to make sure I understand this correctly. In the profile, row 19, it says
ä a
Replacing this with
ä ä
should get rid of the problem of the disappearing trema for this particular vowel, and in order to get rid of the problem as a whole, we would have to search and replace all wrong mappings of this kind?
This does not answer the question where the mapping V+trema -> V without trema comes from in the first place. It seems awfully random, with all the other diacritics being preserved faithfully.
@LuPaschen, the profile was created by somebody some ttime, and I do not know who created it, so don't look at the history of the profile, but just take the profile as the major point where you need to improve. This is the pure computer-assisted workflow we are talking about: we do the first things, the expert does the fine-tuning, and it seems that in the course of discussions in the past, nobody has actually explained what the profiles are made for: they are made for you guys to adjust the automatic results. If you open the file carefully in libreoffice (with tabstop as separator, no quotes), you can edit it, save it, rerun the analysis, and you'll sese that you have less question marks. In fact, all cases with question marks should first be checked by you, they have nothing to do with clts, only with the orthography profile.
@LinguList, I'm also now curious about the origin of that profile, because it might help us to diagnose the problem we have with the centralisation diacritics, which are specified here https://github.com/cldf/clts/blob/master/src/pyclts/transcriptionsystems/bipa/diacritics.tsv, but which don't seem to be recognised in the datasets I am checking, e.g. lines 120 and 122 here https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/multimedia.tsv. Have you any idea what might be going on here?
see my update on the links. The feature was ill-defined: it is "centralization", not centrality.
In [5]: bipa['ë'].s
Out[5]: 'ë'
In [6]: bipa['ë'].name
Out[6]: 'centralized unrounded close-mid front vowel'
I'm closing this issue, we can go back to it once CLTS is released (anytime soon) and then list all the issues individually.
Diacritics to add:
Diacritic aliases:
Consonants to add:
Consonant aliases:
Consonant feature names to change:
To check:
roundedness
androunding
are listed as features in the bipa diacriticsTo discuss:
To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/bdpa.tsv:
To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/diachronica.tsv:
To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/eurasian.tsv:
To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/lapsyd.tsv:
To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/multimedia.tsv:
To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/nidaba.tsv:
To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/pbase.tsv:
To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/phoible.tsv:
To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/powoco.tsv:
All okay: https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/apics.tsv https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/beijingdaxue.tsv https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/chomsky.tsv
To check: https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/panphon.tsv (can't download this .tsv)
Consider abandoning: https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/ruhlen.tsv