Double punctiation break phonemization

cfrancesco commented 4 years ago

I do not have an extensive list, but many double punctuation patterns break the phonemization. One example being !' Phonemizer from pip version 2.2

~/anaconda3/envs/ttsTF/lib/python3.6/site-packages/phonemizer/phonemize.py in phonemize(text, language, backend, separator, strip, preserve_punctuation, punctuation_marks, with_stress, language_switch, njobs, logger)
    172     # phonemize the input text
    173     return phonemizer.phonemize(
--> 174         text, separator=separator, strip=strip, njobs=njobs)

~/anaconda3/envs/ttsTF/lib/python3.6/site-packages/phonemizer/backend/espeak.py in phonemize(self, text, separator, strip, njobs)
    233         # finally restore the punctuation
    234         return self._phonemize_postprocess(
--> 235             text, text_type, punctuation_marks)
    236 
    237     def _command(self, fname):

~/anaconda3/envs/ttsTF/lib/python3.6/site-packages/phonemizer/backend/base.py in _phonemize_postprocess(self, text, text_type, punctuation_marks)
    138         # restore the punctuation is asked for
    139         if self.preserve_punctuation:
--> 140             text = self._punctuator.restore(text, punctuation_marks)
    141 
    142         # output the result formatted as a string or a list of strings

~/anaconda3/envs/ttsTF/lib/python3.6/site-packages/phonemizer/punctuation.py in restore(cls, text, marks)
    147 
    148         """
--> 149         return cls._restore_aux(str2list(text), marks, 0)
    150 
    151     @classmethod

~/anaconda3/envs/ttsTF/lib/python3.6/site-packages/phonemizer/punctuation.py in _restore_aux(cls, text, marks, num)
    162             if current.position == 'E':
    163                 return [text[0] + current.mark] + cls._restore_aux(
--> 164                     text[1:], marks[1:], num + 1)
    165             if current.position == 'A':
    166                 return [current.mark] + cls._restore_aux(

~/anaconda3/envs/ttsTF/lib/python3.6/site-packages/phonemizer/punctuation.py in _restore_aux(cls, text, marks, num)
    175                 restored = cls._restore_aux(
    176                     [text[0] + current.mark + text[1]] + text[2:],
--> 177                     marks[1:], num)
    178             return restored
    179         else:

~/anaconda3/envs/ttsTF/lib/python3.6/site-packages/phonemizer/punctuation.py in _restore_aux(cls, text, marks, num)
    178             return restored
    179         else:
--> 180             return [text[0]] + cls._restore_aux(text[1:], marks, num + 1)

~/anaconda3/envs/ttsTF/lib/python3.6/site-packages/phonemizer/punctuation.py in _restore_aux(cls, text, marks, num)
    162             if current.position == 'E':
    163                 return [text[0] + current.mark] + cls._restore_aux(
--> 164                     text[1:], marks[1:], num + 1)
    165             if current.position == 'A':
    166                 return [current.mark] + cls._restore_aux(

~/anaconda3/envs/ttsTF/lib/python3.6/site-packages/phonemizer/punctuation.py in _restore_aux(cls, text, marks, num)
    162             if current.position == 'E':
    163                 return [text[0] + current.mark] + cls._restore_aux(
--> 164                     text[1:], marks[1:], num + 1)
    165             if current.position == 'A':
    166                 return [current.mark] + cls._restore_aux(

~/anaconda3/envs/ttsTF/lib/python3.6/site-packages/phonemizer/punctuation.py in _restore_aux(cls, text, marks, num)
    161                     [current.mark + text[0]] + text[1:], marks[1:], num)
    162             if current.position == 'E':
--> 163                 return [text[0] + current.mark] + cls._restore_aux(
    164                     text[1:], marks[1:], num + 1)
    165             if current.position == 'A':

IndexError: list index out of range

mmmaat commented 4 years ago

Hi, can I have a complete example of a failing command please, with input text and options?

mmmaat commented 4 years ago

Ok I understood the bug, it occurs when trying to restore punctuation on an empty text. I'll publish a fix soon. Thanks for reporting.

mmmaat commented 4 years ago

Fixed in https://github.com/bootphon/phonemizer/commit/ee591edd05ad013a295f324e49347dbe2576ac80.

michael-conrad commented 3 years ago

Don't know if this is related or not, but:

000004280: Hélas! . ni l'un ni l'autre ne ressemblait au sien.
Traceback (most recent call last):
  File "/home/muksihs/git/Cherokee-TTS/data/comvoi_ipa/generateTrainingData.py", line 59, in <module>
    use_sampa=False)
  File "/home/muksihs/miniconda3/envs/Cherokee-TTS/lib/python3.7/site-packages/phonemizer/phonemize.py", line 172, in phonemize
    text, separator=separator, strip=strip, njobs=njobs)
  File "/home/muksihs/miniconda3/envs/Cherokee-TTS/lib/python3.7/site-packages/phonemizer/backend/base.py", line 126, in phonemize
    text = self._punctuator.restore(text, punctuation_marks)
  File "/home/muksihs/miniconda3/envs/Cherokee-TTS/lib/python3.7/site-packages/phonemizer/punctuation.py", line 146, in restore
    return cls._restore_aux(str2list(text), marks, 0)
  File "/home/muksihs/miniconda3/envs/Cherokee-TTS/lib/python3.7/site-packages/phonemizer/punctuation.py", line 166, in _restore_aux
    [text[0] + m.mark + text[1]] + text[2:], marks[1:], n)
  File "/home/muksihs/miniconda3/envs/Cherokee-TTS/lib/python3.7/site-packages/phonemizer/punctuation.py", line 166, in _restore_aux
    [text[0] + m.mark + text[1]] + text[2:], marks[1:], n)
IndexError: list index out of range

pip show phonemizer
Name: phonemizer
Version: 2.1
Summary: Simple text to phones converter for multiple languages
Home-page: https://github.com/bootphon/phonemizer
Author: Mathieu Bernard
Author-email: mathieu.a.bernard@inria.fr
License: GPL3
Location: /home/muksihs/miniconda3/envs/Cherokee-TTS/lib/python3.7/site-packages
Requires: segments, attrs, joblib
Required-by:

mmmaat commented 3 years ago

Hi, indeed you should upgrade your phonemizer version:

>>> from phonemizer import phonemize
>>> utt = "Hélas! . ni l'un ni l'autre ne ressemblait au sien." 
>>> phonemize(utt, backend='espeak', language='fr-fr', preserve_punctuation=True)
'elas ! . ni lœ̃ ni lotʁ nə ʁəsɑ̃blɛt o sjɛ̃ .'

I got the version

$ phonemize --version                                                                             
phonemizer-2.2.2
available backends: espeak-ng-1.50, espeak-mbrola, festival-2.5.0, segments-2.1.3

bootphon / phonemizer

Double punctiation break phonemization #54