cldf-clts / pyclts

Apache License 2.0
10 stars 2 forks source link

Changes to pyclts/BIPA #6

Closed tresoldi closed 3 years ago

tresoldi commented 4 years ago

@cormacanderson and I reviewed the current coverage of the PHOIBLE inventories in order to identify potential problems/improvements, preparing a new release. I am attaching the full list (.txt extension, GitHub does not allow .tsv): clts_changes.txt

The issues are:

tresoldi commented 4 years ago

Pinging @lingulist @xrotwang @cormacanderson

xrotwang commented 4 years ago

I think I agree with all your recommendations. We might want to have a small document about policies for additions - ideall small enough to fit in CONTRIBUTING.md. What I'd add there is that

while CLTS also lists phonemes, it has a focus on transcriptions, i.e. things actually encountered in texts or wordlists - as opposed to listings of phoneme inventories which may be encountered in grammars. In this sense CLTS is "raw" data driven.

I suspect that this rule can be applied to resolve a couple of the issues listed above, too.

LinguList commented 4 years ago

You see, @tresoldi, the largest number of cases are CLUSTERs that have slipped in, because they are captured by the algorithm. These left unnoticed, since nobody wanted to check the segments one by one. This is what I meant when I said that including triphthongs would make this problematic. It will allow for too much power, and too many graphemes to be accepted, although they are an alias.

The cases you list there can be actively handled. I suggest you split them and make a PR for them step by step, so we do not have too much to review at once.

Triphthongs are out of question for now.

LinguList commented 4 years ago

In addition, may I point you, @tresoldi to the new issues in https://github.com/cldf-clts/cldf ? As the data should be actively expanded, I would welcome if this could also be done along with handling the issues.

tresoldi commented 4 years ago

Yes, as there are no changes to pyclts we can move the issues there. I'll keep this open for the time being and split into different issues in the data repository.

LinguList commented 4 years ago

In fact, @tresoldi, I though these were clts issues, but these are phoible issues, so all you have to do is handle them from the phoible transcription data, in a first instance, right? This should then probably also be done with the other transcription datasets, like ruhlen, etc.

tresoldi commented 4 years ago

Not all are strictly Phoible, I suppose. The aliases for clicks came up in the past (I remember Anne-Maria was not entirely happy about the current status).

But one more reason to split the issue into multiple ones in clts-data.

LinguList commented 4 years ago

(I remember Anne-Maria was not entirely happy about the current status).

There was time to change things long time ago. I wonder why this did not happen before. But I suppose, when you make an issue on click sounds there now and propose a solution, that you get back to Anne and have her in fact comment on this.

LinguList commented 4 years ago

BTW: all new tones proposed are combinations that should be displayed on the vowel, as they are not a characteristic of the tone itself, such as creakyness, so they will be left unlinked and do not need to be handled here.

LinguList commented 4 years ago

Sorry, just saw you agree to discard them.

LinguList commented 4 years ago

In fact, if there really is something like the idea to map all of phoible, for example, it would be good to add a comment to all those items which one leaves unlinked... in this way we make sure they are explicitly dealt with by colleagues.

LinguList commented 4 years ago

One more note on click consonants: my annotation stems from a set of recommendations provided by Anne, so the problem is rather the phoible representation, I guess, as she was recommending to me how to handle them. So the most important thing is to get in touch with her if you want to add new click sounds.

tresoldi commented 4 years ago

Sure, I'll do it.

cormacanderson commented 4 years ago

I agree with your suggestions @tresoldi. I would be inclined though to add the less rounded etc. diacritic, which is fair use of the IPA, and would also like to include the alveolar one, which isn't strictly correct IPA usage, but is frequent. As for triphthongs, I see the obstacle to adding them as being practical, rather than anything else. There is nothing in the IPA that precludes them and linguists do use triphthongs, after all, so a priori, including them here is a desideratum. However, I acknowledge the technical impediments, so perhaps we can come up with a workaround for transcription data in which we frequently encounter them.

LinguList commented 4 years ago

As to triple vowels, I say: we do them very last, and only if substantial evidence is brought up that it is worth mapping them. For 30 sounds in phoible alone, I'd not do it. But even in that case, one can make a list of them encountered and store them somewhere to wait until we have a critical mass to then address the problem. So in short: we do that later, not first, but we can then discuss in due time.

cormacanderson commented 4 years ago

@xrotwang while I agree with that piece of text in principle, as a should, I don't think that's how it has been in practice. For phonemic transcriptions, the phoneme inventories listed in individual sources should be identical the character set used in the transcription data in that source. In these cases, segmentation is established a priori and is intrinsic to the presentation of the data. I would expect the character set in such cases to be in two figures for most languages, three figures only for outliers. For phonetic transcriptions, such as Sound Comparisons, segmentation is not intrinsic to the analysis, but is rather a necessary evil contingent on the structure of the IPA. The speech signal is by nature contiguous and the IPA forces us to represent it using discrete means. Here, I would expect the character set per language to be considerably larger, frequently three figures for individual languages.

LinguList commented 4 years ago

For phonemic transcriptions, the phoneme inventories listed in individual sources should be identical the character set used in the transcription data in that source.

See our paper on IGT in this regard: https://hcommons.org/deposits/item/hc:27765/

This shows that it is in fact not the case that all is consistent always. The data-driven aspect is thus still crucial. The phoneme inventory is often not the same as the one you find being used in the glossary, at least that's my personal impression from the sources with which I worked. Why is that so? Because the correspondence between parts in a book is not checked for computationally.

cormacanderson commented 4 years ago

That's why I said "should be" ;) Agreed on the data-driven approach and the usefulness of checking computationally. However, it's also the case that similar issues to those encountered above are likely to arise working off the transcription data, with the additional problem of segmentation thrown into the mix...