Updates to CLTS transcriptions

cormacanderson commented 5 years ago

Diacritics to add:

[x] Raising: lowered, e.g. v̞ and raised, e.g. ɹ̝ ɾ̝ type: consonant, feature: raising, value: lowered (U031E https://www.charbase.com/031e-unicode-combining-down-tack-below) type: consonant, feature: raising, value: raised (U031D https://www.charbase.com/031d-unicode-combining-up-tack-below)
[x] Retraction: retracted, e.g. s̠ θ̠ ç̠ and advanced, e.g. k̟
[x] Half length, e.g. nˑ mˑ
[x] syllabicity 1, e.g. ŋ̣ ɫ̣ and syllabicity 2, e.g. r̝̩
[x] laminal, e.g. s̻ (now okay? see https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/eurasian.tsv rows 197ff, but z̻ not recognised)
[x] linguolabial, combining seagull below
[x] alveolar, e.g. θ͇ (U0347)
[x] uvularization, e.g. ʌʶ (questionable)

Diacritic aliases:

[x] U032B combining inverted double arch below should be an alias of U033C combining seagull below
[x] NOT underdot (U0323 combining dot below) as an alias of combining vertical line below, given widespread use for retroflex consonants in S Asia and emphatic consonants in W Asia – decision based on pragmatism not principle

Consonants to add:

[x] U1D91 ᶑ, voiced retroflex implosive consonant (in Unicode although not explicitly accepted as IPA)
[x] b̪ voiced labio-dental stop consonant (combination of U0062 with diacritic U032A combining bridge below)
[x] p̪ voiceless labio-dental stop consonant (combination of U0070 with U0032A combining bridge below)

Consonant aliases:

[x] voiced dental implosive stop, alias of https://clts.clld.org/parameters/voiced_dental_implosive_consonant (should be alveolar?)

Consonant feature names to change:

[x] labialized-velar to labio-velar and labialized-palatal to labio-palatal

To check:

[x] why are centralisation diacritics not being recognised, e.g. https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/diachronica.tsv row 494, https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/multimedia.tsv rows 120 (centralised) and 122 (mid-centralised)
[x] rounding diacritic ɜ̹ (in CLTS but error in SndCmp Germanic report)
[x] both roundedness and rounding are listed as features in the bipa diacritics
[x] all instances of the devoiced velar nasal https://clts.clld.org/parameters/devoiced_voiced_velar_nasal_consonant with diacritics have the combining ring below not above, whereas the basic character has it above (in CLTS)
[x] see also click combinations beginning with an alias of https://clts.clld.org/parameters/devoiced_voiced_velar_nasal_consonanthttps://clts.clld.org/parameters/devoiced_voiced_velar_nasal_consonant

To discuss:

[x] U0239 https://www.charbase.com/0239-unicode-latin-small-letter-qp-digraph, i.e. a voiceless labio-dental stop onset to the affricate and voiced version of above, i.e. beginning https://www.charbase.com/0238-unicode-latin-small-letter-db-digraph. These are labio-dental affricates, formally distinct from the labial to labio-dental affricates with that name in CLTS as simply pf https://clts.clld.org/parameters/voiceless_labio-dental_affricate_consonant. Theoretically there are three possibilities for voiceless affricates here: bilabial pɸ, bilabial to labiodental pf, and labio-dental p̪f, for which the qp grapheme could be used. There are obviously also voiced versions of these. @afehn have you any comment here?

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/bdpa.tsv:

[x] row 830 o̜̞ˑ not recognised – lowered, less rounded, half long mid rounded vowel
[x] row 1366 ø̈ and 1314 aɔ̈ – what is the problem here?
[x] row 1035 t͡θʰ aspirated non-sibilant affricate – what is the problem here?
[x] row 766 ł U0141 (Polish) add as alias of U026B?
[x] row 856 ı dotless i U0131 (Turkish) add as alias of ɯ U026F?
[x] row 1236 small capital E alias of U025B? also 920, 1174
[x] row 503 tripthong ɛiə, where ɛi and iə are independently recognised, also 529, 687, 713, 789, 818, 819, 848 etc.
[ ] rows 72, 74, 78, 104, 138, 149, 154, 158 etc. consonants with combining breve
[ ] row 181, 195 etc. vowels with combining circumflex – tone?
[ ] row 875 vowel with combining caron – tone?
[x] row 568 ɔʰ, also 805, 1031, 1033, 1201
[x] row 754 w̹ rounded [w]?
[ ] row 809 tʰ̠ with retracted diacritic under superscript h, row 1035 tʰ̄ with macron?
[x] row 961 ə̩ syllabic schwa?
[x] row 981, 984 e.g. ⁰² tone contours starting with zero?
[ ] row 1292 uˡ not sure what this is
[x] row 1392 ɛʲ – alias of diphthong?

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/diachronica.tsv:

[x] row 12, 47 etc. combining seagull below
[x] row 494 ë, why is this not being parsed?
[x] row 77 ḱ, row 189 ɡ́, aliases of palatal stops (source PIE palatal)
[ ] row 446, 447, acute and circumflex accent on vowels, also grave row 488, macron row 493
[x] row 487 eːʲ and row 506 oːʷ as diphthong aliases?
[x] row 383 ŕ, palatal rhotic? contrasting in Proto-Elamo-Dravidian (sic.) with r̀, row 382
[x] row 183 ɡ̌, Proto-Turkic
[ ] row 8, 22, 29, 42 etc. ʼm etc. not clear what this is in source – preglottalised?
[ ] row 220 ś and row 221 š as aliases of palatal fricative and postalveolar fricative?
[ ] row 348 ʕ̞, a voiced pharyngeal approximant, Egyptian Arabic
[ ] rows 564-595, 599-600, 602-608 are just diacritics, suspect this a problem of notation such as tʰ in source corpus, or possibly as class labels, e.g. ʰ for all aspirated segments
[ ] rows 614-642, 647-653 ordinary characters, not phonetic data at all

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/eurasian.tsv:

[x] row 1516 t, the browser gets this...
[x] row 298 pʲʰ this should be fine!
[ ] row 720 dʒː this should be fine!
[ ] row 908 n̥̪ this should be fine?!
[ ] row 1108 d̥ʒ̥ this should be fine!
[ ] row 480 ɫ̪ and row 481 ɾ̪ to add (question θ̪)
[x] row 197 s̻ laminal s is recognised, but not z̻ laminal z in row 198. Weird...
[ ] row 936, 937 vowel + semivowel combinations, e.g. aj, as diphthong aliases?
[ ] rows 951, 952 etc. opposite, e.g. ja
[ ] row 987 ff. e.g. ʼɡ preglottalised stops?
[ ] rows 1296 ff. e.g. b˞ etc. unclear what these are
[ ] rows 1374 ff. e.g. t͇s͇etc. unclear what these are (unnecessary alveolar diacritic)
[ ] rows 88. 90 etc. brackets
[ ] rows 170, 171 etc. looks like U0311 combing inverted breve, with consonants. No idea as to function here
[ ] rows 388, 389 etc. ʱb etc. preaspirated voiced stops!?

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/lapsyd.tsv:

[x] multiple rows, e.g. 131, 133-137, 164, 168, 172, 173 etc. are simply vowels enclosed in single quotation marks – not our problem
[x] row 12 ? Not sure what this character is? tab?
[x] there are a number of four-character "segments"
[ ] row 39-51 click combinations beginning with an alias of https://clts.clld.org/parameters/devoiced_voiced_velar_nasal_consonanthttps://clts.clld.org/parameters/devoiced_voiced_velar_nasal_consonant (see above)
[ ] combinations of coronal consonant plus r, e.g. tr, nr, rr
[ ] some vowels with single quotes either side of them, e.g. 'e'

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/multimedia.tsv:

[x] row 100, 101 mistakes for combining seagull below – add as aliases, cf first note under diacritic aliases
[x] rows 120, 122 failure of centralised and midcentralised diacritic
[x] row 125 what is this?
[ ] row 133-139 conventional diacritics (acute etc.) for tone

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/nidaba.tsv:

[ ] row 13 ɝ rhotacised reverse open e with hook, to be added
[ ] row 297 should be in as alias of https://clts.clld.org/parameters/devoiced_voiced_retroflex_lateral_approximant_consonant, row 298 as alias of https://clts.clld.org/parameters/devoiced_voiced_retroflex_nasal_consonant, row 462 as alias of https://clts.clld.org/parameters/syllabic_voiced_retroflex_nasal_consonant, row 468 too
[ ] row 62 ɮ̪ dental lateral fricative should be added, also row 66 ɺ̪, row 67 ɹ̪, row 89 ɾ̪
[ ] row 73, 75 etc. whistled sibilants
[ ] row 161, what is this? row 173 ɝ̯
[ ] row 878 t̪͡θʰ should be okay
[ ] row 1117 ɖ͡ʐː not recognised, but ɖ͡ʐ is fine. Problem with length and affricates?

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/pbase.tsv:

[ ] rows 10, 11 not rendering properly but prenasalised affricates, also 107, curiously not 108
[ ] row 19 ts̀ "tense" affricate, also 118, also 318; row 142 "lax", also 151, 399, 426, 446, 447
[x] row 20 tsjʰ and row 21 tswʰ input errors for labialised and palatalised aspirated alvelolar sibilant affricates
[ ] row 24 tɕɥ should be labio-palatal superscript, secondary articulation, also row 42, also 313, also 330; row 153 wɦ should be superscript too
[x] row 59 t̪s̪ɦ input error – should be voiced not voiceless, also 110
[ ] rows 129-138 clicks, 178-180, 182, 184, 193-199, 207-209, 211, 214-215, 220-222, 224, 232-236, etc. to 272
[ ] row 159 ʔj preglottalised palatal approximant
[ ] row 165 ĵ "lenis"
[ ] row 169 qɰ supposedly voiced uvular approximant ʁ̞
[ ] row 319 š trilled sibilant
[x] row 326-327 z͡ɣw again!
[ ] row 347-348 dental to labiodental fricatives, also 384 labiodental to postalveolar
[ ] rows 385 and 386 co-articulated velar-labial fricatives ɣ͡β and x͡ɸ , also 556, 557 for nasals

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/phoible.tsv:

[x] row 197 *R
[ ] row 403 alveolar diacritic

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/powoco.tsv:

[x] row 309 ɡ̥ʰ, I don't see the problem here – why is this not being parsed
[x] row 74 š, post-alvelolar stop alias?
[x] rows 372, 375 tones beginning in 0
[ ] row 262 ʔj
[ ] rows 30, 49, 73 not transcription data

All okay: https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/apics.tsv https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/beijingdaxue.tsv https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/chomsky.tsv

To check: https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/panphon.tsv (can't download this .tsv)

Consider abandoning: https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/ruhlen.tsv

LinguList commented 5 years ago

These updates should all be done until mid of next week, to allow for us to have some more time to actually check for other things etc.

The ligature tie, btw, is regularly deleted by clts:

In [9]: from pyclts.transcriptionsystem import *

In [11]: bipa = TranscriptionSystem('bipa')

In [12]: bipa['ʐ͡ɣ']
Out[12]: UnknownSound(ts=<pyclts.transcriptionsystem.TranscriptionSystem object at 0x7f38c733c7f0>, grapheme='ʐɣ', source='ʐ͡ɣ', generated=False, note=None)

The reason that sound is NOT accepted is that we restrict double-articulation to a certain number of sounds.

cormacanderson commented 5 years ago

Re the ligature tie, in this case CLTS did not delete it, but rather threw up the entire combination as an error, so the problem is actually a question of whether to add the grapheme. Still editing here.

LinguList commented 5 years ago

The syllabicity is a problem of using the wrong diacritic:

In [14]: c = bipa['ŋ̣']

In [15]: c.uname
Out[15]: 'LATIN SMALL LETTER ENG / COMBINING DOT BELOW'

In [15]: c.uname
Out[15]: 'LATIN SMALL LETTER ENG / COMBINING DOT BELOW'

In [23]: c2 = bipa['syllabic voiced alveolar nasal consonant']

In [24]: c2.s
Out[24]: 'n̩'

In [25]: c2.uname
Out[25]: 'LATIN SMALL LETTER N / COMBINING VERTICAL LINE BELOW'

So you are using the wrong character here. We can add it to our list of aliaases, but I would not like to do that, as the dot is used for meaning other things.

LinguList commented 5 years ago

Re the ligature tie, in this case CLTS did not delete it, but rather threw up the entire combination as an error, so the problem is actually a question of whether to add the grapheme. Still editing here.

As I said, and if you see my code example above, you will see what CLTS does: we restrict double-articulattion things to stops and nasals and don't allow for all combinations, for in principle good reasons.

cormacanderson commented 5 years ago

Actually, and this surprised me, but the dot is actually IPA standard for syllabicity. If we follow the IPA strictly, it is the combining vertical line below that should be the alias.

LinguList commented 5 years ago

So, @tresoldi, as you'll look into this: please have a look at the way in which I just evaluated the problems, and decide afterwards in your PR. We may have to add new features, I can point to where to do so in the code, but we also need to check with the input, and see if it's an actual CLTS error, or why clts does not accept a certain sound, etc.

LinguList commented 5 years ago

And @cormacanderson, following the installation instructions, you may want to test my code examples yourself, as they will give you a deeper impression of how these things are handled.

tresoldi commented 5 years ago

I'm checking with @cormacanderson , we'll decide later on a per-item basis (for example, the affricate thing might actually just be an error when ɡɣ is intended).

cormacanderson commented 5 years ago

@LinguList we are still adding things here. I haven't even finished the description of the issue. This involves copy and pasting for various sources. Once I have done the copy and paste, I will edit.

LinguList commented 5 years ago

I'm checking with @cormacanderson , we'll decide later on a per-item basis (for example, the affricate thing might actually just be an error when ɡɣ is intended).

I am not sure. chekc here pleae: http://www.lfsag.unito.it/ipa/index_en.html

and otherwise, point me to the IPA where the dot is intendet for this.

LuPaschen commented 5 years ago

Re syllabicity: The dot used to be official IPA (as of 1996, see https://linguistlist.org/unicode/ipa.html ) but was replaced by the vertical line in subsequent versions (e.g. https://upload.wikimedia.org/wikipedia/commons/8/8e/IPA_chart_2018.pdf ). So having them as aliases would make sense from our perspective, unless the dot is reserved for some other purpose.

LinguList commented 5 years ago

So if you start classifying the problems in tthese categories, e.g., as alias etc., this would already be very helpful.

cormacanderson commented 5 years ago

@LinguList please be patient. As I said above, I am still setting up this issue. First I copy and paste, then I start sorting it out.

cormacanderson commented 5 years ago

@LuPaschen okay I thought you said it was current IPA. Good to know. In that case, given the number of different ways in which it is used in the literature, e.g. for emphatics or retroflexes, I don't think it wise to set up the dot as an alias of the combining vertical line below. I will add this to the relevant SndCmp issue instead and it can be dealt with through an orthography profile or just as a search and replace.

tresoldi commented 5 years ago

I'm checking with @cormacanderson , we'll decide later on a per-item basis (for example, the affricate thing might actually just be an error when ɡɣ is intended).

I am not sure. chekc here pleae: http://www.lfsag.unito.it/ipa/index_en.html

I'm not finding ʐ͡ɣ there, is it the right link?

My guess of ɡɣ (with no bar) is due to both articulatory problems (moving from palatal to velar and to sibilant to non-sibilant in a single affricate seems too much, especially for distinctivity from a pure palatal affricate) and to the fact that it is attested in some Germanic dialects (like good in English Cockney, as per wiki). ʐ͡ɣ has almost no Google hits (even without the bar, i.e. "ʐɣ"), almost none related to phonology besides some uncited papers, and apparently is not used in any other language (not even Phoible has it). An additional point here is that a sibilant velar fricative, which would be expected in case of such affricate, is considered impossible in the IPA.

Given that it is not in IPA and that there is no scholar reference on the segment itself, I would reject the inclusion of such segment and keep the suggestion to check if the intended sound isn't actually /ɡɣ/ (unless some supporting material is provided, of course).

cormacanderson commented 5 years ago

I have added ʐ͡ɣ to https://github.com/lingdb/Sound-Comparisons/issues/484

cormacanderson commented 5 years ago

@LuPaschen I can't add you to that issue for some reason. Can you assign yourself?

LinguList commented 5 years ago

I just figured it was important to inform you that a structured investigation is needed, and I thought it was nice to share some code examples. I'll ignore all of your discussions now and check with @tresoldi later, when you have your systematized proposals.

LuPaschen commented 5 years ago

@cormacanderson I remember now that I did say that the dot was current IPA -- sorry for the confusion, the correct story is in my previous comment. I cannot assign myself to the #484 but I am watching so I get notifications.

cormacanderson commented 5 years ago

@LuPaschen okay no problem. Your name doesn't pop up when I type @ either. Maybe you're not part of the lingdb consortium or the cldf one?

cormacanderson commented 5 years ago

@LinguList finished editing the issue now, having added everything. I think it's systematic now but comments of course welcome. I will look through some of the other transcription datasets and add to this, but I think that already these changes will substantially improve our coverage. @tresoldi and I will meet again in the first half of next week to advance this.

LuPaschen commented 5 years ago

2 things to add here:

The linguolabial diacritic ◌̼ U+033C, ̼ (e.g. "s̼", "z̼")
Centralisation and mid-centralisation diacritics are deleted by check_export.py ˈfʊ̆iɹ̝ä -> f ʊ̆i � a ˈbrɪɒ̟ˑdɛ̽ɹ̞ -> b r ɪɒ̟ˑ d ɛ �

Not sure if 2. is due to CLTS or if it has sth to do with the script, but it needs checking.

cormacanderson commented 5 years ago

@LuPaschen thanks for the comments, although better in future to add such suggestions to https://github.com/cldf/clts/issues/120 (editing the description, with checkboxes, ideally). I will move them in here if necessary afterwards. Responding specifically:

I actually have U+033C mentioned here already under the diacritic aliases, but I checked and it does indeed need added (wasn't sure when I put up the issue, but checked there now).
The centralisation diacritics are actually already specified in BIPA. I'm not sure why they are being deleted and this is something to be checked. One to ask Bibiko about.

LinguList commented 5 years ago

@LuPaschen you know we have a web-interface that lists more or less the last version of clts? We can also update (in fact, we will soon): http://calc.digling.org/clts

The idea is that you manually segment your sounds and paste them to the field, so you see what clts makes out of it.

LinguList commented 5 years ago

If it does not work, this means we have not found that sound segment in any existing database so far, so even if you'd say it is missing, it is potentially extremely rare (but we find also that with now 8000!!! different segments, it is not surprising that there are more things we cannot cover for now).

LinguList commented 5 years ago

More importantly also: clicking on the sound will give you the unicode character, so this is a convenient way to check -- even if a sound is not yet accepted in clts -- how it is composed.

LinguList commented 5 years ago

Centralisation is potentially only defined for vowels: https://github.com/cldf/clts/blob/master/src/pyclts/transcriptionsystems/bipa/diacritics.tsv

You can easily search here, mentioning a "feature", and you'll see that "centralized" is accepted, but only for vowels.

LuPaschen commented 5 years ago

@LinguList Thanks for the link to the web-interface, this is really helpful. I checked the centralized and mid-centralized vowels from the examples above ("ä", "ɛ̽") and they are indeed not recognized by clts, which is unexpected given that they are listed under the diacritics in BIPA.

LinguList commented 5 years ago

Ah, in fact, what I said is important: this is just what we HAVE in all datasets we added so far, as CLTS involves a rather complex algorithm that wouldn't run easily on a server. So this clts-app is just showing you what IS there reflected in datasets, while CLTS itself may be able to generate. Which you can check in an interactive python session, if you have clts installed:

>>> from pyclts.transcriptionsystem import TranscriptionSystem as TS
>>> bipa = TS('bipa')
>>> bipa['ä'].name
'unrounded open centralized vowel'

So this is accepted by CLTS, but it does apparently not occur in the data we considered to load that database.

LuPaschen commented 5 years ago

Ok, thanks for the clarification. Since the disappearing diacritics are not due to CLTS then, I'll add this to tresoldi/soundcomparisons/issues .

LinguList commented 5 years ago

@LuPaschen, let's be clear about "disappearing diacritics": this is what you mean is produced by the automatic segmentation iwth the question marks, right? This is a process that is due to you having not specified the sounds in the orthography profile. Do you have Python running on your computer and could make an interactive python session? If so, I can give you instructions on how to test all of this quickly on your computer, so you'll have it easier to understand where your orthography profile fails and where clts is missing something...

LinguList commented 5 years ago

@LuPaschen, just confirmed that: the orthography profile is missing this. So it is on your side, even not on the side of the code that @bibiko writes there.

LuPaschen commented 5 years ago

Who wrote that profile? We didn't.

LinguList commented 5 years ago

https://github.com/tresoldi/soundcomparisons/blob/master/profile.tsv

this is an automatically generated profile, and your task is to update it if it fails, since the segmentation is a process independent of evaluation by clts (obviously). This also explains why you do not get the pre-aspirated stuff.

So the workflow is:

segmentation (using the profile, that you have to edit) in the script
checking with clts

LuPaschen commented 5 years ago

So just to make sure I understand this correctly. In the profile, row 19, it says ä a Replacing this with ä ä should get rid of the problem of the disappearing trema for this particular vowel, and in order to get rid of the problem as a whole, we would have to search and replace all wrong mappings of this kind?

This does not answer the question where the mapping V+trema -> V without trema comes from in the first place. It seems awfully random, with all the other diacritics being preserved faithfully.

LinguList commented 5 years ago

@LuPaschen, the profile was created by somebody some ttime, and I do not know who created it, so don't look at the history of the profile, but just take the profile as the major point where you need to improve. This is the pure computer-assisted workflow we are talking about: we do the first things, the expert does the fine-tuning, and it seems that in the course of discussions in the past, nobody has actually explained what the profiles are made for: they are made for you guys to adjust the automatic results. If you open the file carefully in libreoffice (with tabstop as separator, no quotes), you can edit it, save it, rerun the analysis, and you'll sese that you have less question marks. In fact, all cases with question marks should first be checked by you, they have nothing to do with clts, only with the orthography profile.

cormacanderson commented 5 years ago

@LinguList, I'm also now curious about the origin of that profile, because it might help us to diagnose the problem we have with the centralisation diacritics, which are specified here https://github.com/cldf/clts/blob/master/src/pyclts/transcriptionsystems/bipa/diacritics.tsv, but which don't seem to be recognised in the datasets I am checking, e.g. lines 120 and 122 here https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/multimedia.tsv. Have you any idea what might be going on here?

LinguList commented 5 years ago

see my update on the links. The feature was ill-defined: it is "centralization", not centrality.

In [5]: bipa['ë'].s
Out[5]: 'ë'

In [6]: bipa['ë'].name
Out[6]: 'centralized unrounded close-mid front vowel'

tresoldi commented 5 years ago

I'm closing this issue, we can go back to it once CLTS is released (anytime soon) and then list all the issues individually.

cldf-clts / clts-legacy

Updates to CLTS transcriptions #121