Added preliminary work for Uralic Phonetic Alphabet.

tresoldi commented 6 years ago

A preliminary and rudimentary version, as proposed at https://github.com/cldf/clts/issues/79

I'm only adding a TSV file, without the necessary changes to dump.py, statistics.py, pyclts/transcriptiondata.py etc. These can be added later, if this mapping is accepted.

LinguList commented 6 years ago

This looks promising already. Very nice and many thanks!

What I'd do, however, is use the data that you have there to create the consonants.tsv and the vowels.tsv for the transcription systems folder. Diacritics can be empty, normalize.tsv can be slightly modified or just taken from bipa.

With the current clts, you can actually print the symbols in the TSV-form needed by simply invoking bipa[sound].table. So assuming your file is called "cons.tsv", I just modified as follows:


In [1]: from lingpy import csv2list

In [2]: from pyclts import *

In [3]: bipa = TranscriptionSystem('bipa')

In [8]: out = []

In [9]: for line in csv2list('cons.tsv'):
   ...:     try:
   ...:         tbl = bipa[line[1]].table
   ...:         tbl[0] = line[2]
   ...:         out += [tbl]
   ...:     except:
   ...:         out += [['!']+line]

In [12]: with open('consonants.tsv.txt', 'w') as f:
    ...:     f.write('GRAPHEME\tPHONATION\tPLACE\tMANNER\tALIAS\tEXTRA\tNOTE\n')
    ...:     for line in out[1:]:
    ...:         f.write('\t'.join(line)+'\n')

The resulting file is here: consonants.tsv.txt

In fact, I just saw we should separate consonants and vowels ;)

Length marker would be candidates to be put into the diacritics in addition, similar to palatalization marker.

LinguList commented 6 years ago

There are further some sounds not recognized, but here's the solution proposed:

name	upa	note
devoiced voiced labio-dental stop consonant	ʙ͔	here, we lack the labio-dental stop in our data, we should consider adding ist
palatalized voiced alveolar consonant	ď	rewriting as "palatalized voiced alveolar stop consonant" will do
palatalized voiceless alveolar lateral approximant consonant	ʟ́	we seem to lack the voiceless lateral approximant, or, if the accent denotes devoicing, we should rather call it "palatalized devoiced voiceless alveolar lateral approximant consonant"
palatalized voiceless alveolar nasal consonant	ɴ́	same as one up
palatalized voiceless alveolar trill consonant	ʀ́	dito
voiced labio-dental stop consonant	b͔	again our problem with the labio-dental,which is missing in bipa
voiceless labio-dental stop consonant	p͔	dito
voiceless uvular nasal consonant	<?><!>	ᴎ͔

tresoldi commented 6 years ago

Sorry, not sure if I understood:

Should I use this rudimentary data tp generate a transcriptionsystem/upa resource?
Should I wait for the inclusion of the missing segments in BIPA or should I try to include them?

LinguList commented 6 years ago

yes, a transcriptionsystem/upa would be excellent, and yes, please add them, if they are missing ;)

LinguList commented 6 years ago

Excellent work! May I ask you in addition to add the following information:

a short comment in https://github.com/cldf/clts/blob/master/data/datasets.tsv (similar to our concepticon structure)
the bibtex of the source you mention in references.bib

Furthermore, you may want to consider adding some diacritics (for devoicing and revoicing, for example, so that the system becomes more productive, let me know if the format is not clear).

tresoldi commented 6 years ago

Finally able to get back. Diacritics, markers, normalization and tones are missing, as they are either non-existent or I'm not sure. The only ones really important are diacritics, I'll work on them later.

In [1]: from pyclts import TranscriptionSystem

In [2]: upa = TranscriptionSystem('upa')

In [3]: upa['ʙ']
Out[3]: <pyclts.models.Consonant: devoiced voiced bilabial stop consonant>

tresoldi commented 6 years ago

Done with the references, will work on the diacritics using the ISO standard.

tresoldi commented 6 years ago

I've just commited a few diacritics, but I am in doubt about the others.

Turned and sideways letters should be probably be added as graphemes (they are "reduced" versions, usually lax vowels)
vowel palatalization and velarization, as well as retracted and advanced forms, are not identical to advanced and retracted tongue root, but we might compromise here
I'm not sure if the UPA diacritics for raising and lowering are precisely raising and lowering in the IPA sense, as they seem to be used to represent sounds without their own graphemes, and not exactly raised or lowered versions of the base sound
- Coarticulation due to surrounding sounds should probably be treated directly in graphemes, too.

LinguList commented 6 years ago

Super. Feel free to merge and many thanks!

tresoldi commented 6 years ago

Thank you. I'll work on the turned and sideways vowels when I get back home, than merge, shouldn't tak emuch.

I'll move to X-SAMPA later. Your deadline for the article was next week, right?

2018-01-26 19:26 GMT-02:00 Johann-Mattis List notifications@github.com:

Super. Feel free to merge and many thanks!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cldf/clts/pull/82#issuecomment-360908735, or mute the thread https://github.com/notifications/unsubscribe-auth/AAar9-08xnZRZNHPp3MHleAXxkDQSYnDks5tOkMigaJpZM4RT6Mv .

LinguList commented 6 years ago

Thank you. I'll work on the turned and sideways vowels when I get back home, than merge, shouldn't tak emuch. I'll move to X-SAMPA later. Your deadline for the article was next week, right?

Yes, we'll submit on Wednesday, but we can also do with UPA without X-Sampa for the time being, I'd say. We'll sub mit the code anonymously via osf-framework, and show a few screenshots of the CLLD app. We have 12 transcription datasets right now, several sound classes, and 5 transcription systems with UPA, I think this is impressive enough, even if there are still a few bugs to be resolved.

tresoldi commented 6 years ago

Great. I'm merging UPA, then, after adding near-close near-front vowels. Two main issues:

some graphemes are missing, mostly turned and rotated glyphs, I need to check the Unicode code data and actual usage to make sure how they are code. As far as I can tell, these missing graphemes are not that common.
Some clean-up is probably due in the vowel listing, now that diacritics have been implemented. Of course, one need to take care in terms of pre-composed glyphs.

LinguList commented 6 years ago

Some clean-up is probably due in the vowel listing, now that diacritics have been implemented. Of course, one need to take care in terms of pre-composed glyphs.

If vowels with diacritics are listed redundantly, this is even better. The diacritics are only a shortcut guaranteeing a better "generation".

cldf-clts / clts-legacy

Added preliminary work for Uralic Phonetic Alphabet. #82