Open kudanai opened 2 years ago
Here are some comments:
There are two ways of writing geminate consonants in the IPA:
For reasons of parseability with PanPhon, the second solution is the approved Epitran solution (so <އް> could simply be mapped to /ː/). If you need doubling instead, you can achieve this with a regular expression and named groups, e.g.:
(?P<seg>(p|t|k)): -> \g<seg>\g<seg> \ _
will change p:
, t:
, and k:
to pp
, tt
, and kk
.
The prefixed \s?
in your rules is not doing any good since it doesn't rule anything out—a substring either is or is not preceded by a space. In any case, you should be using Epitran with already tokenized text rather than passages with internal whitespace. Otherwise, your rules look fine.
Thank you for the comments.
First on the \s
, they are a bit tricky in this script. The effects of the next consonant on އް can go beyond the token boundary. What this probably actually means is that I need a better tokeniser than the currently available ones. I will investigate more on this. It will be sorted out before I request a merge.
On the geminates, at first I gave this a try, which did not seem to work (I'm not sure if I'm writing that rule wrong or if the \g
syntax just isn't working for me). So taking your suggestion on using the long mark, I rewrote it using the swap groups
## This did not work
<AS>(?P<seg>::consonant::) -> \g<seg>\g<seg> / _
## outputs
ނުގެންނަން މުވައްޒަފުންގެ
nuɡennammuʋa\g<seg>\g<seg>afuŋɡe
but
## This works
(?P<sw1><AS>)\s?(?P<sw2>::consonant::) -> 0 / _
<AS> -> : / (::consonant::) _ (::vowel::)
## outputs
ނުގެންނަން މުވައްޒަފުންގެ
nuɡennammuʋaz:afuŋɡe
Does this have an impact on affricates? We have two /d͡ʒ/
and /t͡ʃ/
Also, I'm hesitant to simply map އް to : - it would complicate the post processing rules since the language uses a lot long vowels, and އް can also cause pre-nasalisation or serve as a glottal stop depending on context.
I'm trying to write post processing rules for
div-Thaa
over on this fork.The rules dicatate that for occurrences of certain graphemes
އް
would have effect of having the next consonant be a geminate in some situations. I can't seem to figure out if this can be done with a single regex rule with a match group or not.For the time being I've added the cases as individual rules here. The rules in question are the ones with
<AS>
in them.TL;DR;
Apologies in advance if this is a redundant question and I missed something in the docs.