cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

adding state-of-the-art x-sampa conversion #28

Open LinguList opened 6 years ago

LinguList commented 6 years ago

following up from this issue we should have an orthography profile filled in with many x-sampa symbols that would be converted to our BIPA standard defined in clts.

bambooforest commented 6 years ago

How about something like this, but in the orthography profile specification (+ @xrotwang):

informal name, tipa, praat, xsampa

ipacharmap={'p':['vl bilabial plosive','p','p','p'], 'b':['vd bilabial plosive','b','b','b'], 't':['vl alveolar plosive','t','t','t'], 'd':['vd alveolar plosive','d','d','d'], 'ʈ':['vl retroflex plosive','\:t','\t.','t'], 'ɖ':['vd retroflex plosive','\:d','\d.','d'], 'ɟ':['vd palatal plosive','\textbardotlessj','\j^','j'], 'k':['ld velar plosive','k','k','k'], 'ɡ':['vd velar plosive','g','\gs','g'], 'q':['vl uvular plosive','q','q','q'], 'ɢ':['vd uvular plosive','\;G','\gc','G\\'], 'ʔ':['glottal plosive','P','\?g','?'], 'm':['bilabial nasal','m','m','m'], 'ɱ':['vl labiodental nasal','M','\mj','F'], 'n':['alveolar nasal','n','n','n'], 'ɳ':['vl retroflex nasal','\:n','\\n.','n'], 'ɲ':['vl palatal nasal','\textltailn','\nj','J'], 'ŋ':['vl velar nasal','N','\ng','N'], 'ɴ':['vl uvular nasal','\;N','\nc','N\'], 'ʙ':['vd bilabial trill','\;B','\bc','B\'], 'r':['vd alveolar trill','r','r','r'], 'ʀ':['vl uvular trill','\;R','\rc','R\'], '':['labiodental flap','none','none','none'], 'ɾ':['vl alveolar tap','R','\fh','4'], 'ɽ':['vl retroflex flap','\:r','\f.','r'], 'ɸ':['vl bilabial fricative','F','\\ff','p\\'], 'β':['vd bilabial fricative','B','\\bf','B'], 'f':['vl labiodental fricative','f','f','f'], 'v':['vd labiodental fricative','v','v','v'], 'θ':['vl dental fricative','T','\\tf','T'], 'ð':['vd dental fricative','D','\dh','D'], 's':['vl alveolar fricative','s','s','s'], 'z':['vd alveolar fricative','z','z','z'], 'ʃ':['vl postalveolar fricative','S','\sh','S'], 'ʒ':['vd postalveolar fricative','Z','\zh','Z'], 'ʂ':['vl retroflex fricative','\:s','\s.','s'], 'ʐ':['vd retroflex fricative','\:z','\z.','z'], 'ç':['vl palatal fricative','\c{c}','\c,','C'], 'ʝ':['vd palatal fricative','J','\jc','j\\'], 'x':['vl velar fricative','x','x','x'], 'ɣ':['vd velar fricative','G','\gf','G'], 'χ':['vl uvular fricative','X','\cf','X'], 'ʁ':['vd uvular fricative','K','\\ri','R\\'], 'ħ':['vl pharyngeal fricative','\\textcrh','\h-','X\\'], 'ʕ':['vd pharyngeal fricative','Q','\9e','?\\'], 'h':['vl glottal fricative','h','h','h'], 'ɦ':['vd glottal fricative','H','\h^','h\\'], 'ɬ':['vl alveolar lateral fricative','\\textbeltl','\l-','K'], 'ɮ':['vd alveolar lateral fricative','\\textlyoghlig','\lz','K\\'], 'ʋ':['vd labiodental approximant','V','\\vs','P'], 'ɹ':['vd (post)alveolar approximant','\*r','\\rt','r\\'], 'ɻ':['vd retroflex approximant','\:R','\\r.','r\'], 'ɰ':['vd velar approximant','\textturnmrleg','\ml','M\'], 'l':['vd alveolar lateral approximant','l','l','l'], 'ɭ':['vd retroflex lateral approximant','\:l','\l.','l`'], 'ʎ':['vd palatal lateral approximant','L','\yt','L'], 'ʟ':['vd velar lateral approximant','\;L','\lc','L\'], 'ɕ':['vl alveolopalatal fricative','C','\cc','s\'], 'ʤ':['vd postalveolar affricate','\textdyoghlig','none','dZ'], 'ɧ':['vl multiple-place fricative','\texththeng','\hj','x\'], 'ɥ':['labial-palatal approximant','4','\ht','H'], 'ʜ':['vl epiglottal fricative','\;H','\hc','H\'], 'ɫ':['velarized vl alveolar lateral','\textltilde','\l~','5'], 'ɺ':['vl alveolar lateral flap','\textturnlonglegr','\rl','l\'], 'ʧ':['vl postalveolar affricate','\textteshlig','none','tS'], 'ʍ':['vl labial-velar fricative','*w','\wt','W'], 'ʑ':['vl alveolopalatal fricative','\textctz','\zc','z\'], 'ʡ':['vl epiglottal plosive','\textbarglotstop','\?-','>\'], 'ʢ':['vl epiglottal fricative','\textbarrevglotstop','\9-','<\'], 'w':['vd labio-velar approximant','w','w','w'], 'ʘ':['bilabial click','!o','\O.','O\'], 'ǀ':['dental click','\textpipe','|1','|\'], 'ǃ':['retroflex click','!','!','!\'], 'ǂ':['alveolar click','\textdoublebarpipe','|-','=\'], 'ǁ':['alveolar lateral click','\textdoublepipe','|2','||\'], 'ɓ':['vl bilabial implosive','!b','\b^','b<'], 'ɗ':['vl alveolar implosive','!d','\d^','d<'], 'ʄ':['vl palatal implosive','!j','\i-','J_<'], 'ɠ':['vl velar implosive','!g','\g^','g<'], 'ʛ':['vl uvular implosive','!G','\G^','G_<'], 'ʼ':['ejective','\'','none','_>'], 'ʴ':['rhotacized','UNKNOWN','none','none'], 'ʰ':['aspirated','\textsuperscript{h}','H','_h'], 'ʱ':['breathy-voice-aspirated','\textsuperscript{H}','none','none'], 'ʲ':['palatalized','\textsuperscript{j}','J','\''], 'ʷ':['labialized','\textsuperscript{w}','W','_w'], 'ˠ':['velarized','\textsuperscript{\textgamma}','none','G'], 'ˤ':['pharyngealized','\textsuperscript{\textrevglotstop}','none','?\'], '˞':['rhotacized','\textrhoticity','none','@'], '̥':['voiceless','\\r*','\0v','_0'], '̊':['vl voiceless (use if character has descender)','UNKNOWN\\r{}','none','none'], '̤':['breathy voiced','\"*','\:v','_t'], '̪':['dental','\|[','\Nv','_d'], '̬':['voiced','\\v*','none','_v'], '̰':['creaky voiced','\~*','\~v','_k'], '̺':['apical','\|]','none','_a'], '̼':['linguolabial','UNKNOWN\|m{}','none','_N'], '̻':['laminal','UNKNOWN\\textsubsquare{}','none','_m'], '̚':['not audibly released','\\textcorner','none','_}'], '̹':['more rounded','\|)','\3v','_O'], '̃':['nasalized','\~','\~^\'','~'], '̜':['less rounded','\|(','\cv','_c'], '̟':['advanced','\|+','\+v','_+'], '̠':['retracted','\=*','\-v','_-'], '̈':['centralized','\"','\:^','_"'], '̴':['velarized or pharyngealized','none','none','_e'], '̽':['mid-centralized','UNKNOWN\|x{}','none','_x'], '̝':['raised','\|\'','\T^','_r'], '̩':['syllabic','UNKNOWN\s{}','\|v','='], '̞':['lowered','\|','\Tv','o'], '̯':['non-syllabic','UNKNOWN\textsubarch{}','none','^'], '̘':['advanced tongue root','|<','none','_A'], '̙':['retracted tongue root','|>','none','_q'], 'ˈ':['(primary) stress mark','"','\\'1','"'], 'ˌ':['secondary stress','""','\\'2','%'], 'ː':['length mark',':','\:f',':'], 'ˑ':['half-length',';','none',':\'], '̆':['extra-short','UNKNOWN\u{}','none','_X'], '͜':['tie bar below','UNKNOWN\t{}','none','-\'], '͡':['tie bar above','UNKNOWN\t{}','\li','none'], '|':['Minor (foot) group','\textvertline','|','|'], '‖':['Major (intonation) group','\textdoublevertline','||','||'], '.':['syllable break','.','.','none'], '̋':['extra high tone','UNKNOWN\H{}','none','_T'], '́':['high tone','HTONE','\\'^','_H'], '̄':['mid tone','MTONE','-^','_M'], '̀':['low tone','LTONE','`^','_L'], '̏':['extra low tone','\H','none','_B'], '̌':['rising tone','??\v{}','UNKNOWN','_L_H'], '̂':['falling tone','\^','none','_H_L'], '↓':['downstep','\textdownstep','none','<!>'], '↑':['upstep','\textupstep','none','<^>'], '↗':['global rise','\textglobrise','none','</>'], '↘':['global fall','\textglobfall','none','<>'], 'i':['close front unrounded','i','i','i'], 'y':['close front rounded','y','y','y'], 'e':['close-mid front unrounded','e','e','e'], 'a':['open front unrounded','a','a','a'], 'u':['close back rounded','u','u','u'], 'o':['close-mid back rounded','o','o','o'], 'ɑ':['open back unrounded','A','\as','A'], 'ɐ':['open-mid schwa','5','\at','6'], 'ɒ':['open back rounded','6','\ab','Q'], 'æ':['raised open front unrounded','\ae','\ae','{'], 'ɔ':['open-mid back rounded','O','\ct','O'], 'ə':['schwa','@','\sw','@'], 'ɘ':['close-mid schwa','9','\e-','@\'], 'ɚ':['rhotacized schwa','\textrhookschwa','\sr','@'], 'ɛ':['open-mid front unrounded','E','\ef','E'], 'ɜ':['open-mid central','3','\er','3'], 'ɝ':['rhotacized open-mid central','\\textrhookrevepsilon','none','3'], 'ɞ':['open-mid central rounded','\textcloserevepsilon','\kb','3\'], 'ɨ':['close central unrounded','1','\i-','1'], 'ɪ':['lax close front unrounded','I','\ic','I'], 'ɯ':['close back unrounded','W','\mt','M'], 'ø':['front close-mid rounded','\o','\o/','2'], 'ɵ':['rounded schwa','8','\o-','8'], 'œ':['front open-mid rounded','\oe','\oe','9'], 'ɶ':['front open rounded','\OE','\Oe','&'], 'ʉ':['close central rounded','\textbaru','\u-','}'], 'ʊ':['lax close back rounded','U','\hs','U'], 'ʌ':['open-mid back unrounded','\textturnv','\vt','V'], 'ɤ':['close-mid back unrounded','\textramshorns','\rh','7'], 'ʏ':['lax close front rounded','Y','\yc','Y'], }

From a very old project that I worked on:

https://github.com/sofarrar/e-linguistics/blob/master/eltk/eltk/utils/CharConverter.py

LinguList commented 6 years ago

No, I'd prefer something like our masterlist of sounds which we extracted from comparing transcription datasets, like phoible. Your list is giving the glyphs, not the graphemes, but what we need is all potential graphemes to have a useful orthography profile, and we happen to have a big masterlist for this (see here).

xrotwang commented 6 years ago

Wait, so is this X-SAMPA or SAMPA we're talking about? Isn't there a big difference in X-SAMPA allowing for all sorts of local/custom extensions?

bambooforest commented 6 years ago

(Shameless copy & paste from the Unicode Cookbook, page 43):

SAMPA (Wells et al. 1992) is an ASCII representation of the IPA symbols.

Two problems with SAMPA are that (i) it is only a partial encoding of the IPA and (ii) it encodes different languages in separate data tables, instead of using a universal alphabet, like IPA. SAMPA is essentially a hack to work around displaying IPA characters, but it provided speech technology and other fields a basis that has been widely adopted and often still used in code.

An extended version of SAMPA, called X-SAMPA, set out to include every symbol, including all diacritics, in the IPA chart (Wells 1995). X-SAMPA is considered more universally applicable because it consists of one table that encodes all characters in IPA. In line with the principles of the IPA, SAMPA and X-SAMPA include a repertoire of symbols. These symbols are intended to represent phonemes rather than all allophonic distinctions. Additionally, both ASCII-ifications of IPA – SAMPA and X-SAMPA – are (reportedly) uniquely parsable (Wells 1995). However, like the IPA, X-SAMPA has different notations for encoding the same phonetic phenomena (cf. Section 4.5).

LinguList commented 6 years ago

We just need one version of X-SAMPA. There are not many people using it, and I have my own version with a few extensions that works rather well. For a more or less frequently used version of X-SAMPA, starting from lingpy's conversion table is probably the best, since this is taken from the Groningen folk who have been using X-SAMPA in almost all of their project. For this reason, one could consider their use of X-SAMPA as some kind of a standard.

xrotwang commented 6 years ago

I see. Funny I had this backwards :)

tresoldi commented 6 years ago

The TranscriptionSystem for X-SAMPA I submitted was, in fact, a slightly altered subset of the VIM mapping by @lingulist (I probably should have mentioned it before).

2018-01-29 8:50 GMT-02:00 Johann-Mattis List notifications@github.com:

We just need one version of X-SAMPA. There are not many people using it, and I have my own version with a few extensions that works rather well. For a more or less frequently used version of X-SAMPA, starting from lingpy's conversion table is probably the best, since this is taken from the Groningen folk who have been using X-SAMPA in almost all of their project. For this reason, one could consider their use of X-SAMPA as some kind of a standard.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cldf/segments/issues/28#issuecomment-361208449, or mute the thread https://github.com/notifications/unsubscribe-auth/AAar9yTugwJp4lTX0KA3Kh4V7klzCFrBks5tPaJbgaJpZM4RwVlZ .

xrotwang commented 6 years ago

@LinguList But is there a reason why a X-SAMPA orthography profile should live in the segments package rather than in CLTS - considering that it will typically be updated when CLTS is updated?

LinguList commented 6 years ago

Yes, we discussed this in full length with @tresoldi: x-sampa translates IPA to ASCII, it is not a transcription system, so we would have to run new code in CLTS to accommodate for it (current clts code cannot deal with X-sampa, as it does confuse the distinction between "diacritic" and "base character", and clts uses this for the regexes. Also, given that X-Sampa is normally reflecting rather straitghtforward IPA scripts, it seems better accommodated with segments, as X-sampa is also about segmenting the things properly (and x-sampa is normally NOT segmented when written). Same, btw, applies to the IPA I recommended, as it could be used to replace or add to lingpy's segmentation algo...

xrotwang commented 6 years ago

Ok, one more question: Is it the case (at least in theory), that X-SAMPA text can be converted to IPA text by naively replacing the glyphs (as listed in @bambooforest 's mapping)? If so, X-SAMPA would be just a replacement list to be run before segmenting using an IPA orthography profile, right?

xrotwang commented 6 years ago

Btw. what I hint at above would be one of the things that become simple (or at least very transparent) with a metadata description file à la CSVW: An X-SAMPA orthography profile would reference the IPA orthography profile in the table description, but also include a reference to a replacement list, basically transliterating the text as a pre-processing step.

tresoldi commented 6 years ago

Yes, provided we have it clear that a glyph might be more than one character -- i.e., just like IPA.

xrotwang commented 6 years ago

Yes, looks like a glyph can be made up by more than one char:

'ʄ':['vl palatal implosive','\!j','\i-','J\_<']

But this would still work in a scheme where longest glyphs are replaced first, right?

LinguList commented 6 years ago

yes, it should, in fact all sampa-parsers are built on this system. In fact: it has long been missng as proof of concept that we have good IPA segmenters and good sampa-segmenters baesd on segments...

xrotwang commented 6 years ago

So with some tweaks this already works:

$ segments --profile=xsampa.tsv --mapping=IPA tokenize "sXa?q_wa?" | tr -d " " | segments --profile=../clts/data/sounds.tsv tokenize
s χ a ʔ qʷ a ʔ

So it appears that X-SAMPA segmentation could be done as concatenation of two orthography profiles. Maybe this is the functionality segments should support - a pipeline of profiles?

xrotwang commented 6 years ago

In the above, xsampa.tsv is created from the dictionary @bambooforest pasted, as xsampa.txt

xrotwang commented 6 years ago

@LinguList @tresoldi Btw. a pipeline of two profiles could also be used in the lexibank context to model the blacklist/replacement list we have in lexemes.tsv at the moment.

tresoldi commented 6 years ago

It sounds interesting and it is a good to explain the difference between an IPA glyph and what it represents (especially if we come up with things that strict IPA can't do). Plus, piping sound really unix-like.

I didn't really understand, however, how would this replace blacklist/replacements in lexemes.tsv?

xrotwang commented 6 years ago

lexemes.tsv would simply be the first orthography profile to apply - making the whole process conceptually simpler.

xrotwang commented 6 years ago

@tresoldi it seems I was a bit over-optimistic wrt replacing lexemes.tsv by an additional profile, because in lexemes.tsv we list possibly complete phrases, including whitespace etc.

tresoldi commented 6 years ago

Yes, that is what I had in mind. Nonetheless, the orthographic concatenation is still useful and elegant, and might make sure that people (including myself) only use lexemes.tsv as an errata (which I believe was the intention in the first place). :)

xrotwang commented 6 years ago

@tresoldi see #34 for a proposal how to merge lexemes.tsv into a profile after all.

tresoldi commented 6 years ago

Yes, it is clearer now. The question is: should exceptions that demand what is in fact an if-clause (like the "bischen" example you gave) be handled by the profile or by lexemes.tsv?

A good practice might be to put generalizations of exceptions (like, words working similar to "bischen", with such orthography due to the language history) in the profile, reserving lexemes for true exceptions (like some Italian loanword into German where "sch" is /s k/, such as in the musical term "scherzo"). What do you think?