Closed tresoldi closed 3 years ago
Pinging @lingulist @xrotwang @cormacanderson
I think I agree with all your recommendations. We might want to have a small document about policies for additions - ideall small enough to fit in CONTRIBUTING.md. What I'd add there is that
while CLTS also lists phonemes, it has a focus on transcriptions, i.e. things actually encountered in texts or wordlists - as opposed to listings of phoneme inventories which may be encountered in grammars. In this sense CLTS is "raw" data driven.
I suspect that this rule can be applied to resolve a couple of the issues listed above, too.
You see, @tresoldi, the largest number of cases are CLUSTERs that have slipped in, because they are captured by the algorithm. These left unnoticed, since nobody wanted to check the segments one by one. This is what I meant when I said that including triphthongs would make this problematic. It will allow for too much power, and too many graphemes to be accepted, although they are an alias.
The cases you list there can be actively handled. I suggest you split them and make a PR for them step by step, so we do not have too much to review at once.
Triphthongs are out of question for now.
In addition, may I point you, @tresoldi to the new issues in https://github.com/cldf-clts/cldf ? As the data should be actively expanded, I would welcome if this could also be done along with handling the issues.
Yes, as there are no changes to pyclts
we can move the issues there. I'll keep this open for the time being and split into different issues in the data repository.
In fact, @tresoldi, I though these were clts issues, but these are phoible issues, so all you have to do is handle them from the phoible transcription data, in a first instance, right? This should then probably also be done with the other transcription datasets, like ruhlen, etc.
Not all are strictly Phoible, I suppose. The aliases for clicks came up in the past (I remember Anne-Maria was not entirely happy about the current status).
But one more reason to split the issue into multiple ones in clts-data.
(I remember Anne-Maria was not entirely happy about the current status).
There was time to change things long time ago. I wonder why this did not happen before. But I suppose, when you make an issue on click sounds there now and propose a solution, that you get back to Anne and have her in fact comment on this.
BTW: all new tones proposed are combinations that should be displayed on the vowel, as they are not a characteristic of the tone itself, such as creakyness, so they will be left unlinked and do not need to be handled here.
Sorry, just saw you agree to discard them.
In fact, if there really is something like the idea to map all of phoible, for example, it would be good to add a comment to all those items which one leaves unlinked... in this way we make sure they are explicitly dealt with by colleagues.
One more note on click consonants: my annotation stems from a set of recommendations provided by Anne, so the problem is rather the phoible representation, I guess, as she was recommending to me how to handle them. So the most important thing is to get in touch with her if you want to add new click sounds.
Sure, I'll do it.
I agree with your suggestions @tresoldi. I would be inclined though to add the less rounded etc. diacritic, which is fair use of the IPA, and would also like to include the alveolar one, which isn't strictly correct IPA usage, but is frequent. As for triphthongs, I see the obstacle to adding them as being practical, rather than anything else. There is nothing in the IPA that precludes them and linguists do use triphthongs, after all, so a priori, including them here is a desideratum. However, I acknowledge the technical impediments, so perhaps we can come up with a workaround for transcription data in which we frequently encounter them.
As to triple vowels, I say: we do them very last, and only if substantial evidence is brought up that it is worth mapping them. For 30 sounds in phoible alone, I'd not do it. But even in that case, one can make a list of them encountered and store them somewhere to wait until we have a critical mass to then address the problem. So in short: we do that later, not first, but we can then discuss in due time.
@xrotwang while I agree with that piece of text in principle, as a should, I don't think that's how it has been in practice. For phonemic transcriptions, the phoneme inventories listed in individual sources should be identical the character set used in the transcription data in that source. In these cases, segmentation is established a priori and is intrinsic to the presentation of the data. I would expect the character set in such cases to be in two figures for most languages, three figures only for outliers. For phonetic transcriptions, such as Sound Comparisons, segmentation is not intrinsic to the analysis, but is rather a necessary evil contingent on the structure of the IPA. The speech signal is by nature contiguous and the IPA forces us to represent it using discrete means. Here, I would expect the character set per language to be considerably larger, frequently three figures for individual languages.
For phonemic transcriptions, the phoneme inventories listed in individual sources should be identical the character set used in the transcription data in that source.
See our paper on IGT in this regard: https://hcommons.org/deposits/item/hc:27765/
This shows that it is in fact not the case that all is consistent always. The data-driven aspect is thus still crucial. The phoneme inventory is often not the same as the one you find being used in the glossary, at least that's my personal impression from the sources with which I worked. Why is that so? Because the correspondence between parts in a book is not checked for computationally.
That's why I said "should be" ;) Agreed on the data-driven approach and the usefulness of checking computationally. However, it's also the case that similar issues to those encountered above are likely to arise working off the transcription data, with the additional problem of segmentation thrown into the mix...
@cormacanderson and I reviewed the current coverage of the PHOIBLE inventories in order to identify potential problems/improvements, preparing a new release. I am attaching the full list (
.txt
extension, GitHub does not allow.tsv
): clts_changes.txtThe issues are:
[ ] There are 39 graphemes that should be listed as aliases, related to features such as breathy voice, glottalization, pre-glottalization, dental/alveolar, palatalization, etc. They are listed at the top of the TSV file.
[ ] Among those, there are two which theoretically relate to sound classes (
Rʲ
andR̪
). Looking at the inventories it seems they should have been a plain/r/
(i.e., alveolar trill), but I have no problem in disregarding them.[ ] There are about 100 click consonants either written in Africanist notation (instead of strict IPA) or which are not supported due to diacritics (such as
ŋ̥ǃ
). The large majority are cases of glottalization, devoicing, and nasalization. While some might be disputed depending on analytical or even theoretical matters, I'd go for supporting all of them when it is clear what they are supposed to indicate (it is definitely not our job, and not the place, to discuss the articulation of click consonants).[ ] There are some unsupported two-segment clusters. I would only add a handful, cases where it is clear what it is referred to (e.g.,
ɲ̥ɲ̥
, which is just a devoiced version of a geminated consonant already in BIPA).[ ] There is a tautological grapheme,
r̠̙
. While I would have no problem in adding a single redundant entry if necessary, I'd prefer to skip over it (don't want to "open the gates" to all possible redundant notations). Let people sanitize their data.[ ] There are 47 entries which is unclear what they are supposed to represent. I would review a couple in the original literature to see if they make sense and are worth of inclusion, but wouldn't spend much time on it otherwise. Same thing for 55 entries with an unclear rhotic component that might be just the artefacts of notations either trying to preserve alignments or considering invariable co-occurrence in minimal pairs.
[ ] There are 100 entries where Phoible is not supported to due a vertical bar separating allophones. Not our problem.
[ ] We are missing some graphemes due to non-standard alveolar/dental diacritics (e.g.,
s͇
). These are not standard but frequent, maybe we should add as aliases only the cases which are found (I mean, not adding as diacritics, otherwise we can end up adding all the Unicode code-points that kinda look like a line, for example -- and, for this reason, I am happy to skip over them as well)[ ] Phoible also uses a handful of frictionalized sounds marked by a "combining x below" diacritic (se here). I suppose we can disregard these as well (for CLTS/BIPA).
[ ] There are some unsupported tones, but as they are only used in a couple of inventories I think we can disregard them.
[ ] There are many triphthongs which will be unsupported, as per https://github.com/cldf-clts/pyclts/pull/5
[ ] Other issues involve diphthong notation with superscript, diacritics for less rounded, and tongue root retraction. As there are at most 2 instances of each, I suppose we can disregard these as well.