dmort27 / epitran

A tool for transcribing orthographic text as IPA (International Phonetic Alphabet)
MIT License
630 stars 121 forks source link

Regex + rules difference between [] and (). #70

Closed OrderAndCh4oS closed 3 years ago

OrderAndCh4oS commented 3 years ago

In fra-Latn.txt preprocessors there are some matches that use [] and others that use ()

::vowel:: = a|á|â|æ|e|é|è|ê|ë|i|î|ï|o|ô|œ|u|ù|û|ü|A|Á|Â|Æ|E|É|È|Ê|Ë|I|Î|Ï|O|Ô|Œ|U|Ù|Û|Ü|ɛ
::front_vowel:: = e|é|è|ê|ë|i|î|ï|y|E|É|È|Ê|Ë|I|Î|Ï|Y|ɛ
::consonant:: = b|ç|c|ch|d|f|g|j|k|l|m|n|p|r|s|t|v|w|z|ʒ
% Treatment of <c> and <s>
sc -> s / _ [::front_vowel::]
c -> s / _ [::front_vowel::]

% High vowels become glides before vowels
ou -> w / _ (::vowel::)
u -> ɥ / _ (::vowel::)

Is a difference in behaviour between the two?

Am I right in thinking that:

From what I understand of regex they look as though they'd do the same thing, except [::front_vowel::] would also match the | char.

I also don't think [] would work if there are two or more chars in a group, for example ch in:

::consonant:: = b|ç|c|ch|d|f|g|j|k|l|m|n|p|r|s|t|v|w|z|ʒ

I'd guess that () also create capturing groups but I'm not sure if that's being utilised.

Any guidance would be greatly appreciated.

dmort27 commented 3 years ago

These are artifacts of an earlier time. square brackets should be replaced by parentheses.

OrderAndCh4oS commented 3 years ago

Thank you so much for clearing that up. I suspected that was the case, just wanted to make sure I hadn't overlooked something.

Superb project, by the way, has been a great help.