bootphon / phonemizer

Simple text to phones converter for multiple languages
https://bootphon.github.io/phonemizer/
GNU General Public License v3.0
1.18k stars 165 forks source link

Commas and points inside numbers are considered like punctuation #87

Closed donand closed 1 year ago

donand commented 2 years ago

Describe the bug The library does not transcribe commas and points inside numbers, but it considers them as normal punctuation.

Phonemizer version

phonemizer-3.0
available backends: espeak-ng-1.50, segments-2.2.0
uninstalled backends: espeak-mbrola, festival

System

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal
Python 3.9.5

To reproduce Italian

phonemize("4,16 metri", language='it', backend='espeak', preserve_punctuation=True, with_stress=True, language_switch='remove-flags')

>> kwˈatːro ,sˈeditʃɪ mˈetrɪ

English

phonemize("4.16 meters", language='en-us', backend='espeak', preserve_punctuation=True, with_stress=True, language_switch='remove-flags')

>> fˈoːɹ .sˈɪkstiːn mˈiːɾɚz

Expected behavior A clear and concise description of what you expected to happen. Italian

phonemize("4,16 metri", language='it', backend='espeak', preserve_punctuation=True, with_stress=True, language_switch='remove-flags')

>> kwˈatːro vˈirɡola sˈeditʃɪ mˈetrɪ

English

phonemize("4.16 meters", language='en-us', backend='espeak', preserve_punctuation=True, with_stress=True, language_switch='remove-flags')

>> fˈoːɹ pɔɪnt wˈʌn sˈɪks mˈiːɾɚz

Additional context If I set preserve_punctuation=False the comma or the point inside the number is just dropped and not transcribed.

mmmaat commented 2 years ago

Hi, indeed this may be problematic. We must play with this in order to detect if comma or point are surrounded by numbers.

jncasey commented 2 years ago

@donand With the most recently merged PR, it's now possible to achieve what you're after by defining the punctuation with regular expressions.

The default marks are defined as the string ;:,.!?¡¿—…"«»“”

If instead you set the marks to the regular expression [;:!?¡¿—…"«»“”]|[,.](?!\d) commas and periods followed by a digit won't be treated as punctuation.

phonemize("4.16 meters", language='en-us', backend='espeak', preserve_punctuation=True, with_stress=True, language_switch='remove-flags', punctuation_marks=re.compile(r'[;:!?¡¿—…"«»“”]|[,.](?!\d)'))

Or, via the command line, with the new parameter --punctuation-marks-is-regex:

echo "4.16 meters" | phonemize --preserve-punctuation --with-stress --language-switch remove-flags --punctuation-marks '[;:!?¡¿—…"«»“”]|[,.](?!\d)' --punctuation-marks-is-regex 

returns

fˈoːɹ pɔɪnt wˈʌn sˈɪks mˈiːɾɚz