count_phonemized in words_mismatch should use given separator, not spaces to split words

iamanigeeit commented 1 week ago

It's quite common to use spaces to separate the phonemes for speech synthesis.

But this leads to word mismatch problems because count_phonemized splits on whitespace.

>>> from phonemizer.backend import BACKENDS
>>> from phonemizer.separator import Separator
>>> G2P = BACKENDS['espeak'](language='en-us', words_mismatch='warn')
>>> SEP = Separator(word='|', phone=' ')
>>> G2P.phonemize(['try'], separator=SEP)[0]
WARNING:phonemizer:words count mismatch on line 1 (expected 1 words but get 4)
WARNING:phonemizer:words count mismatch on 100.0% of the lines (1/1)
't ɹ aɪ |'

It seems to be a common issue, e.g. https://github.com/bootphon/phonemizer/issues/154 and https://github.com/lifeiteng/vall-e/issues/50

I have fixed this (per below) but let me know if you need a PR for it.

Fix in words_mismatch.py

    @classmethod
    def _count_words(cls, text, wordsep=None):
        """Return the number of words contained in each line of `text`"""
        return [
            len([w for w in line.strip().split(wordsep) if w])
            for line in text]

    def count_phonemized(self, text, wordsep=None):
        """Stores the number of words in each output line"""
        self._count_phn = self._count_words(text, wordsep)

Fix in espeak.py:

    def _phonemize_postprocess(self, phonemized, punctuation_marks, separator):
        text = phonemized[0]
        switches = phonemized[1]

        self._words_mismatch.count_phonemized(text, separator.word)
        self._lang_switch.warning(switches)

        phonemized = super()._phonemize_postprocess(text, punctuation_marks, separator)
        return self._words_mismatch.process(phonemized)

Fix in base.py

    def phonemize(self, text, separator=default_separator,
                  strip=False, njobs=1):
        ...
        return self._phonemize_postprocess(phonemized, punctuation_marks, separator)

    def _phonemize_postprocess(self, phonemized, punctuation_marks, separator):
        ...

Note: this still raises warnings when unexpected line splits occur, such as caps in the middle GameStop or nonword chars before punctuation he said--, no. But it should suffice for most cases and the input text should be normalized properly.

mmmaat commented 1 week ago

Thank's for pointing that bug! Does the fix in the issue_169 branch solve your problem?

iamanigeeit commented 1 week ago

Thanks for adding test cases! I haven't tested as i simply changed my own code.

bootphon / phonemizer

count_phonemized in words_mismatch should use given separator, not spaces to split words #169