dmort27 / epitran

A tool for transcribing orthographic text as IPA (International Phonetic Alphabet)
MIT License
630 stars 121 forks source link

Possibility of using regex for generating geminates #105

Open kudanai opened 2 years ago

kudanai commented 2 years ago

I'm trying to write post processing rules for div-Thaa over on this fork.

The rules dicatate that for occurrences of certain graphemes އް would have effect of having the next consonant be a geminate in some situations. I can't seem to figure out if this can be done with a single regex rule with a match group or not.

For the time being I've added the cases as individual rules here. The rules in question are the ones with <AS> in them.

TL;DR;

Apologies in advance if this is a redundant question and I missed something in the docs.

dmort27 commented 2 years ago

Here are some comments:

There are two ways of writing geminate consonants in the IPA:

  1. Doubling the consonant (unless it is an affricate, in which case the plosive is doubled)
  2. Using the long mark (ː).

For reasons of parseability with PanPhon, the second solution is the approved Epitran solution (so <އް> could simply be mapped to /ː/). If you need doubling instead, you can achieve this with a regular expression and named groups, e.g.:

(?P<seg>(p|t|k)): -> \g<seg>\g<seg> \ _

will change p:, t:, and k: to pp, tt, and kk.

The prefixed \s? in your rules is not doing any good since it doesn't rule anything out—a substring either is or is not preceded by a space. In any case, you should be using Epitran with already tokenized text rather than passages with internal whitespace. Otherwise, your rules look fine.

kudanai commented 2 years ago

Thank you for the comments.

First on the \s, they are a bit tricky in this script. The effects of the next consonant on އް can go beyond the token boundary. What this probably actually means is that I need a better tokeniser than the currently available ones. I will investigate more on this. It will be sorted out before I request a merge.

On the geminates, at first I gave this a try, which did not seem to work (I'm not sure if I'm writing that rule wrong or if the \g syntax just isn't working for me). So taking your suggestion on using the long mark, I rewrote it using the swap groups

## This did not work
<AS>(?P<seg>::consonant::) -> \g<seg>\g<seg> / _

## outputs
ނުގެންނަން މުވައްޒަފުންގެ
nuɡennammuʋa\g<seg>\g<seg>afuŋɡe

but

## This works
(?P<sw1><AS>)\s?(?P<sw2>::consonant::) -> 0 / _
<AS> -> : / (::consonant::) _ (::vowel::)

## outputs
ނުގެންނަން މުވައްޒަފުންގެ
nuɡennammuʋaz:afuŋɡe

Does this have an impact on affricates? We have two /d͡ʒ/ and /t͡ʃ/

Also, I'm hesitant to simply map އް to : - it would complicate the post processing rules since the language uses a lot long vowels, and އް can also cause pre-nasalisation or serve as a glottal stop depending on context.