cldf-clts / clts-legacy

Cross-Linguistic Transcription Systems
Apache License 2.0
4 stars 3 forks source link

Updates to CLTS transcriptions #121

Closed cormacanderson closed 5 years ago

cormacanderson commented 5 years ago

Diacritics to add:

Diacritic aliases:

Consonants to add:

Consonant aliases:

Consonant feature names to change:

To check:

To discuss:

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/bdpa.tsv:

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/diachronica.tsv:

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/eurasian.tsv:

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/lapsyd.tsv:

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/multimedia.tsv:

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/nidaba.tsv:

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/pbase.tsv:

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/phoible.tsv:

To check in https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/powoco.tsv:

All okay: https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/apics.tsv https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/beijingdaxue.tsv https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/chomsky.tsv

To check: https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/panphon.tsv (can't download this .tsv)

Consider abandoning: https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/ruhlen.tsv

LinguList commented 5 years ago

These updates should all be done until mid of next week, to allow for us to have some more time to actually check for other things etc.

The ligature tie, btw, is regularly deleted by clts:

In [9]: from pyclts.transcriptionsystem import *

In [11]: bipa = TranscriptionSystem('bipa')

In [12]: bipa['ʐ͡ɣ']
Out[12]: UnknownSound(ts=<pyclts.transcriptionsystem.TranscriptionSystem object at 0x7f38c733c7f0>, grapheme='ʐɣ', source='ʐ͡ɣ', generated=False, note=None)

The reason that sound is NOT accepted is that we restrict double-articulation to a certain number of sounds.

cormacanderson commented 5 years ago

Re the ligature tie, in this case CLTS did not delete it, but rather threw up the entire combination as an error, so the problem is actually a question of whether to add the grapheme. Still editing here.

LinguList commented 5 years ago

The syllabicity is a problem of using the wrong diacritic:

In [14]: c = bipa['ŋ̣']

In [15]: c.uname
Out[15]: 'LATIN SMALL LETTER ENG / COMBINING DOT BELOW'

In [15]: c.uname
Out[15]: 'LATIN SMALL LETTER ENG / COMBINING DOT BELOW'

In [23]: c2 = bipa['syllabic voiced alveolar nasal consonant']

In [24]: c2.s
Out[24]: 'n̩'

In [25]: c2.uname
Out[25]: 'LATIN SMALL LETTER N / COMBINING VERTICAL LINE BELOW'

So you are using the wrong character here. We can add it to our list of aliaases, but I would not like to do that, as the dot is used for meaning other things.

LinguList commented 5 years ago

Re the ligature tie, in this case CLTS did not delete it, but rather threw up the entire combination as an error, so the problem is actually a question of whether to add the grapheme. Still editing here.

As I said, and if you see my code example above, you will see what CLTS does: we restrict double-articulattion things to stops and nasals and don't allow for all combinations, for in principle good reasons.

cormacanderson commented 5 years ago

Actually, and this surprised me, but the dot is actually IPA standard for syllabicity. If we follow the IPA strictly, it is the combining vertical line below that should be the alias.

LinguList commented 5 years ago

So, @tresoldi, as you'll look into this: please have a look at the way in which I just evaluated the problems, and decide afterwards in your PR. We may have to add new features, I can point to where to do so in the code, but we also need to check with the input, and see if it's an actual CLTS error, or why clts does not accept a certain sound, etc.

LinguList commented 5 years ago

And @cormacanderson, following the installation instructions, you may want to test my code examples yourself, as they will give you a deeper impression of how these things are handled.

tresoldi commented 5 years ago

I'm checking with @cormacanderson , we'll decide later on a per-item basis (for example, the affricate thing might actually just be an error when ɡɣ is intended).

cormacanderson commented 5 years ago

@LinguList we are still adding things here. I haven't even finished the description of the issue. This involves copy and pasting for various sources. Once I have done the copy and paste, I will edit.

LinguList commented 5 years ago

I'm checking with @cormacanderson , we'll decide later on a per-item basis (for example, the affricate thing might actually just be an error when ɡɣ is intended).

I am not sure. chekc here pleae: http://www.lfsag.unito.it/ipa/index_en.html

and otherwise, point me to the IPA where the dot is intendet for this.

LuPaschen commented 5 years ago

Re syllabicity: The dot used to be official IPA (as of 1996, see https://linguistlist.org/unicode/ipa.html ) but was replaced by the vertical line in subsequent versions (e.g. https://upload.wikimedia.org/wikipedia/commons/8/8e/IPA_chart_2018.pdf ). So having them as aliases would make sense from our perspective, unless the dot is reserved for some other purpose.

LinguList commented 5 years ago

So if you start classifying the problems in tthese categories, e.g., as alias etc., this would already be very helpful.

cormacanderson commented 5 years ago

@LinguList please be patient. As I said above, I am still setting up this issue. First I copy and paste, then I start sorting it out.

cormacanderson commented 5 years ago

@LuPaschen okay I thought you said it was current IPA. Good to know. In that case, given the number of different ways in which it is used in the literature, e.g. for emphatics or retroflexes, I don't think it wise to set up the dot as an alias of the combining vertical line below. I will add this to the relevant SndCmp issue instead and it can be dealt with through an orthography profile or just as a search and replace.

tresoldi commented 5 years ago

I'm checking with @cormacanderson , we'll decide later on a per-item basis (for example, the affricate thing might actually just be an error when ɡɣ is intended).

I am not sure. chekc here pleae: http://www.lfsag.unito.it/ipa/index_en.html

I'm not finding ʐ͡ɣ there, is it the right link?

My guess of ɡɣ (with no bar) is due to both articulatory problems (moving from palatal to velar and to sibilant to non-sibilant in a single affricate seems too much, especially for distinctivity from a pure palatal affricate) and to the fact that it is attested in some Germanic dialects (like good in English Cockney, as per wiki). ʐ͡ɣ has almost no Google hits (even without the bar, i.e. "ʐɣ"), almost none related to phonology besides some uncited papers, and apparently is not used in any other language (not even Phoible has it). An additional point here is that a sibilant velar fricative, which would be expected in case of such affricate, is considered impossible in the IPA.

Given that it is not in IPA and that there is no scholar reference on the segment itself, I would reject the inclusion of such segment and keep the suggestion to check if the intended sound isn't actually /ɡɣ/ (unless some supporting material is provided, of course).

cormacanderson commented 5 years ago

I have added ʐ͡ɣ to https://github.com/lingdb/Sound-Comparisons/issues/484

cormacanderson commented 5 years ago

@LuPaschen I can't add you to that issue for some reason. Can you assign yourself?

LinguList commented 5 years ago

I just figured it was important to inform you that a structured investigation is needed, and I thought it was nice to share some code examples. I'll ignore all of your discussions now and check with @tresoldi later, when you have your systematized proposals.

LuPaschen commented 5 years ago

@cormacanderson I remember now that I did say that the dot was current IPA -- sorry for the confusion, the correct story is in my previous comment. I cannot assign myself to the #484 but I am watching so I get notifications.

cormacanderson commented 5 years ago

@LuPaschen okay no problem. Your name doesn't pop up when I type @ either. Maybe you're not part of the lingdb consortium or the cldf one?

cormacanderson commented 5 years ago

@LinguList finished editing the issue now, having added everything. I think it's systematic now but comments of course welcome. I will look through some of the other transcription datasets and add to this, but I think that already these changes will substantially improve our coverage. @tresoldi and I will meet again in the first half of next week to advance this.

LuPaschen commented 5 years ago

2 things to add here:

  1. The linguolabial diacritic ◌̼ U+033C, ̼ (e.g. "s̼", "z̼")
  2. Centralisation and mid-centralisation diacritics are deleted by check_export.py ˈfʊ̆iɹ̝ä -> f ʊ̆i � a ˈbrɪɒ̟ˑdɛ̽ɹ̞ -> b r ɪɒ̟ˑ d ɛ �

Not sure if 2. is due to CLTS or if it has sth to do with the script, but it needs checking.

cormacanderson commented 5 years ago

@LuPaschen thanks for the comments, although better in future to add such suggestions to https://github.com/cldf/clts/issues/120 (editing the description, with checkboxes, ideally). I will move them in here if necessary afterwards. Responding specifically:

  1. I actually have U+033C mentioned here already under the diacritic aliases, but I checked and it does indeed need added (wasn't sure when I put up the issue, but checked there now).
  2. The centralisation diacritics are actually already specified in BIPA. I'm not sure why they are being deleted and this is something to be checked. One to ask Bibiko about.
LinguList commented 5 years ago

@LuPaschen you know we have a web-interface that lists more or less the last version of clts? We can also update (in fact, we will soon): http://calc.digling.org/clts

The idea is that you manually segment your sounds and paste them to the field, so you see what clts makes out of it.

LinguList commented 5 years ago

If it does not work, this means we have not found that sound segment in any existing database so far, so even if you'd say it is missing, it is potentially extremely rare (but we find also that with now 8000!!! different segments, it is not surprising that there are more things we cannot cover for now).

LinguList commented 5 years ago

More importantly also: clicking on the sound will give you the unicode character, so this is a convenient way to check -- even if a sound is not yet accepted in clts -- how it is composed.

LinguList commented 5 years ago

Centralisation is potentially only defined for vowels: https://github.com/cldf/clts/blob/master/src/pyclts/transcriptionsystems/bipa/diacritics.tsv

You can easily search here, mentioning a "feature", and you'll see that "centralized" is accepted, but only for vowels.

LuPaschen commented 5 years ago

@LinguList Thanks for the link to the web-interface, this is really helpful. I checked the centralized and mid-centralized vowels from the examples above ("ä", "ɛ̽") and they are indeed not recognized by clts, which is unexpected given that they are listed under the diacritics in BIPA.

LinguList commented 5 years ago

Ah, in fact, what I said is important: this is just what we HAVE in all datasets we added so far, as CLTS involves a rather complex algorithm that wouldn't run easily on a server. So this clts-app is just showing you what IS there reflected in datasets, while CLTS itself may be able to generate. Which you can check in an interactive python session, if you have clts installed:

>>> from pyclts.transcriptionsystem import TranscriptionSystem as TS
>>> bipa = TS('bipa')
>>> bipa['ä'].name
'unrounded open centralized vowel'

So this is accepted by CLTS, but it does apparently not occur in the data we considered to load that database.

LuPaschen commented 5 years ago

Ok, thanks for the clarification. Since the disappearing diacritics are not due to CLTS then, I'll add this to tresoldi/soundcomparisons/issues .

LinguList commented 5 years ago

@LuPaschen, let's be clear about "disappearing diacritics": this is what you mean is produced by the automatic segmentation iwth the question marks, right? This is a process that is due to you having not specified the sounds in the orthography profile. Do you have Python running on your computer and could make an interactive python session? If so, I can give you instructions on how to test all of this quickly on your computer, so you'll have it easier to understand where your orthography profile fails and where clts is missing something...

LinguList commented 5 years ago

@LuPaschen, just confirmed that: the orthography profile is missing this. So it is on your side, even not on the side of the code that @bibiko writes there.

LuPaschen commented 5 years ago

Who wrote that profile? We didn't.

LinguList commented 5 years ago

https://github.com/tresoldi/soundcomparisons/blob/master/profile.tsv

this is an automatically generated profile, and your task is to update it if it fails, since the segmentation is a process independent of evaluation by clts (obviously). This also explains why you do not get the pre-aspirated stuff.

So the workflow is:

  1. segmentation (using the profile, that you have to edit) in the script
  2. checking with clts
LuPaschen commented 5 years ago

So just to make sure I understand this correctly. In the profile, row 19, it says ä a Replacing this with ä ä should get rid of the problem of the disappearing trema for this particular vowel, and in order to get rid of the problem as a whole, we would have to search and replace all wrong mappings of this kind?

This does not answer the question where the mapping V+trema -> V without trema comes from in the first place. It seems awfully random, with all the other diacritics being preserved faithfully.

LinguList commented 5 years ago

@LuPaschen, the profile was created by somebody some ttime, and I do not know who created it, so don't look at the history of the profile, but just take the profile as the major point where you need to improve. This is the pure computer-assisted workflow we are talking about: we do the first things, the expert does the fine-tuning, and it seems that in the course of discussions in the past, nobody has actually explained what the profiles are made for: they are made for you guys to adjust the automatic results. If you open the file carefully in libreoffice (with tabstop as separator, no quotes), you can edit it, save it, rerun the analysis, and you'll sese that you have less question marks. In fact, all cases with question marks should first be checked by you, they have nothing to do with clts, only with the orthography profile.

cormacanderson commented 5 years ago

@LinguList, I'm also now curious about the origin of that profile, because it might help us to diagnose the problem we have with the centralisation diacritics, which are specified here https://github.com/cldf/clts/blob/master/src/pyclts/transcriptionsystems/bipa/diacritics.tsv, but which don't seem to be recognised in the datasets I am checking, e.g. lines 120 and 122 here https://github.com/cldf/clts/blob/master/src/pyclts/transcriptiondata/multimedia.tsv. Have you any idea what might be going on here?

LinguList commented 5 years ago

see my update on the links. The feature was ill-defined: it is "centralization", not centrality.

In [5]: bipa['ë'].s
Out[5]: 'ë'

In [6]: bipa['ë'].name
Out[6]: 'centralized unrounded close-mid front vowel'
tresoldi commented 5 years ago

I'm closing this issue, we can go back to it once CLTS is released (anytime soon) and then list all the issues individually.