cldf-clts / clts

Cross-Linguistic Transcription Systems
https://clts.clld.org
14 stars 3 forks source link

Pike phonetic symbols #9

Open thiagochacon opened 6 years ago

thiagochacon commented 6 years ago

The attached book has an index of all sounds used in the Pikean tradition of transcriptions, phonemic analyses and practical orthographies. The sounds are accompanied by articulatory descriptions. Nice to have it in CLTS for the future.

Smalley_Pike_Tradition.pdf

tresoldi commented 6 years ago

Doesn't this coincide or, more properly, is superseded by https://github.com/cldf/clts/issues/80 ?

LinguList commented 6 years ago

Good point. In fact, this point exposes an important question we have been avoiding so far: what kind of data do we want to model, and how far do we want to go? For concepts in concepticon, it makes sense to even include the smallest list, as they can later show what has been used in what research. For TS and TD, however, it is far less than obvious to do so: first, if a TS has a sound, it has no greater value, unless we can find it in a real dataset of lexemes (better than inventories), and the same holds for TD (although here, when specialising on some language family, we could theoretically profit, as it will help us to avoid adding new sounds). Second, since we're expanding BIPA to a level (see the palatalized aliases I added yesterday) that is much more than IPA or anything, we could also argue that the whole enterprise of this should rather focus on having a robust system that makes a plain IPA-dialect out of what users use as fuzzy IPA.

However, as far as TD is concerned, there are no virtual limits to annotation. There are smallish datasets like cldf-clts/clts-legacy#26 which provide interesting feature data, but one could also think of adding Chomsky and Halle in this way (just sticking to their features for English), not to speak of data like cldf-clts/clts-legacy#26, where they have similarity or distance matrices among features. So here, we could go historical.

And for a bigger vision: imagining we have added more different TS to the data, we could potentially even use them in order to infer what TS a dataset has been written in, right? I mean, the idiosyncrasies of Ruhlen's data are extremely striking, especially with the superscript affricates. But this would mean that in the end, we'd also need a plain IPA in addition to our broad IPA where we discard most aliases. And such a system could then probably also help to evaluate (e.g., whether data is written in strict IPA).

tresoldi commented 6 years ago

An idea might be to only include TS that have been used in a published dataset, ideally one that is in lexibank. A field "dataset example" with a BibTeX source, perhaps?

Em 5 de jan de 2018 3:17 AM, "Johann-Mattis List" notifications@github.com escreveu:

Good point. In fact, this point exposes an important question we have been avoiding so far: what kind of data do we want to model, and how far do we want to go? For concepts in concepticon, it makes sense to even include the smallest list, as they can later show what has been used in what research. For TS and TD, however, it is far less than obvious to do so: first, if a TS has a sound, it has no greater value, unless we can find it in a real dataset of lexemes (better than inventories), and the same holds for TD (although here, when specialising on some language family, we could theoretically profit, as it will help us to avoid adding new sounds). Second, since we're expanding BIPA to a level (see the palatalized aliases I added yesterday) that is much more than IPA or anything, we could also argue that the whole enterprise of this should rather focus on having a robust system that makes a plain IPA-dialect out of what users use as fuzzy IPA.

However, as far as TD is concerned, there are no virtual limits to annotation. There are smallish datasets like cldf-clts/clts-legacy#26 https://github.com/cldf/clts/issues/26 which provide interesting feature data, but one could also think of adding Chomsky and Halle in this way (just sticking to their features for English), not to speak of data like cldf-clts/clts-legacy#26 https://github.com/cldf/clts/issues/26, where they have similarity or distance matrices among features. So here, we could go historical.

And for a bigger vision: imagining we have added more different TS to the data, we could potentially even use them in order to infer what TS a dataset has been written in, right? I mean, the idiosyncrasies of Ruhlen's data are extremely striking, especially with the superscript affricates. But this would mean that in the end, we'd also need a plain IPA in addition to our broad IPA where we discard most aliases. And such a system could then probably also help to evaluate (e.g., whether data is written in strict IPA).

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/cldf/clts/issues/46#issuecomment-355475124, or mute the thread https://github.com/notifications/unsubscribe-auth/AAar95IbQ5U25w3JoFguXVMqq1LhDLMBks5tHbCAgaJpZM4RAkQ3 .

LinguList commented 6 years ago

An idea might be to only include TS that have been used in a published dataset, ideally one that is in lexibank. A field "dataset example" with a BibTeX source, perhaps?

Or, alternatively, if we deem them important, e.g., the UPA could help us to deal with Uralex data from collaborators, and NAPA is still frequently used. Some Sinitic TS may also help (although here, ortho-profiles usually do).

But if somebody wants to turn this into an historical account, this would of course also possible, and once could register all kinds of olden TS no longer in use, but I'd myself prefer to work along other directions ;)

tresoldi commented 6 years ago

Yes, that's the spirit. Maybe, in line with glottobank projects, there could be a short guide on how to add transcription data/systems, so that everything is solved in the spirit of open access and open source. It can be a project for a after-1.0 milestone.

Em 5 de jan de 2018 1:52 PM, "Johann-Mattis List" notifications@github.com escreveu:

An idea might be to only include TS that have been used in a published dataset, ideally one that is in lexibank. A field "dataset example" with a BibTeX source, perhaps?

Or, alternatively, if we deem them important, e.g., the UPA could help us to deal with Uralex data from collaborators, and NAPA is still frequently used. Some Sinitic TS may also help (although here, ortho-profiles usually do).

But if somebody wants to turn this into an historical account, this would of course also possible, and once could register all kinds of olden TS no longer in use, but I'd myself prefer to work along other directions ;)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cldf/clts/issues/46#issuecomment-355588330, or mute the thread https://github.com/notifications/unsubscribe-auth/AAar93xq3efr2opPRdZNpe6_-7SDKrf2ks5tHkUqgaJpZM4RAkQ3 .

thiagochacon commented 6 years ago

yes, when talking with collaborators here in South America about what kinds of TS to include, we faced the same question about the scope: what we shouldn't include, in fact? The practical answer we had was that the relevance of the TS will tell us when to include it. Traditions of TS that have spanned over several languages, families and regions with just minor modification, will obviously attract first our attention; TS with more restricted distributions will need wait. In fact, any there is a subtle edge between TS and orthography profile that has to be defined in practice, rather than theoretically. This definitely goes hand in had with TD and Lexibank spirit.

tresoldi commented 6 years ago

There should be a soft limit for the threshold of inclusion.

Given that in most cases we have dataset-specific transcription systems that are based in a (perhaps never codified) super system, with a handful of new graphemes or little changes (such as using a simpler-to-type grapheme that is more commonly used for a different segment), I'd say only this super systems should be included.

If changes are really ad-hoc (or, to abuse my Latin, sui generis) it is better to normalize them in Lexibank or in the equivalent parsing tool, mapping them to the appropriate supersystem (which we might need to formalize). As a rule of thumb, I'd say that if a system is found in only one dataset and if we can map it to a system already in clts by means of a list of string replaces, the system already in clts is enough. A super Tupi or Khoisan system would thus likely merit inclusion, but not a modified version used only by a single author, perhaps one who didn't even carry fieldwork but only gathered information online. Index Diachronica comes to mind...

Em 5 de jan de 2018 4:25 PM, "thiagochacon" notifications@github.com escreveu:

yes, when talking with collaborators here in South America about what kinds of TS to include, we faced the same question about the scope: what we shouldn't include, in fact? The practical answer we had was that the relevance of the TS will tell us when to include it. Traditions of TS that have spanned over several languages, families and regions with just minor modification, will obviously attract first our attention; TS with more restricted distributions will need wait. In fact, any there is a subtle edge between TS and orthography profile that has to be defined in practice, rather than theoretically. This definitely goes hand in had with TD and Lexibank spirit.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cldf/clts/issues/46#issuecomment-355628251, or mute the thread https://github.com/notifications/unsubscribe-auth/AAar9zd73PNT4QEmmq8UlERLZ0jN-cShks5tHmj_gaJpZM4RAkQ3 .

thiagochacon commented 6 years ago

yes, this sounds like the way things should go

tresoldi commented 6 years ago

Could you provide us with some concrete examples from SA languages, Thiago? It would be a good test and it is always good to check actually cases since the beginning. Plus, I'm genuinely interested ;)

Em 5 de jan de 2018 4:25 PM, "thiagochacon" notifications@github.com escreveu:

yes, when talking with collaborators here in South America about what kinds of TS to include, we faced the same question about the scope: what we shouldn't include, in fact? The practical answer we had was that the relevance of the TS will tell us when to include it. Traditions of TS that have spanned over several languages, families and regions with just minor modification, will obviously attract first our attention; TS with more restricted distributions will need wait. In fact, any there is a subtle edge between TS and orthography profile that has to be defined in practice, rather than theoretically. This definitely goes hand in had with TD and Lexibank spirit.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/cldf/clts/issues/46#issuecomment-355628251, or mute the thread https://github.com/notifications/unsubscribe-auth/AAar9zd73PNT4QEmmq8UlERLZ0jN-cShks5tHmj_gaJpZM4RAkQ3 .

thiagochacon commented 6 years ago

sure, Tiago : )

the initial idea was to include some languages or TS traditions form SA in the paper we are currently working on for CLTS, but that would go beyond the scope of the paper. I think a separate study would be interesting, and it could also be of a good service to the community if we could include the TS from older sources (travelers and anthropologista before 1950s).

Maybe we can talk about that more specifically at some point in the near future?

LinguList commented 6 years ago

Yes, @thiagochacon, this is one potential way to develop this further. Now, that we're all more or less on the same page regarding the dumb-ass features system, taking the easiness of registering a TS by just defining the base symbols and generating as many sounds as possible, we can think of developing things further, to make this similar to Concepticon in some way. For the first release, we will, of course, stick to the basic assets, like UPA, NAPA, mabye the German dialect thing I have already started to prepare, but the idea of turning CLTS into a tool that could become the quasi-authority on how things in time and through different locations are being interpreted is quite thrilling for me.