erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
686 stars 71 forks source link

[!] Warning: The text length exceeds the character limit of 239 for language 'es', this might cause truncated audio. #232

Closed Mixomo closed 1 month ago

Mixomo commented 1 month ago

[!] Warning: The text length exceeds the character limit of 239 for language 'es', this might cause truncated audio.

Hello, Does this warning have to be taken literally?

I have trained a couple of models, and they have not given me any problems with truncated audios. However the warning has me intrigued... is it due to the limits of the tokenizer for each language? Can this limit be bypassed? Or are the tokenizers already programmed with this limit? I saw reference to the limits in this portion of the tokenizer.py code (sorry if it's not the code the warning actually refers to).

Thank you


def split_sentence(text, lang, text_split_length=250):
    """Preprocess the input text"""
    text_splits = []
    if text_split_length is not None and len(text) >= text_split_length:
        text_splits.append("")
        nlp = get_spacy_lang(lang)
        nlp.add_pipe("sentencizer")
        doc = nlp(text)
        for sentence in doc.sents:
            if len(text_splits[-1]) + len(str(sentence)) <= text_split_length:
                # if the last sentence + the current sentence is less than the text_split_length
                # then add the current sentence to the last sentence
                text_splits[-1] += " " + str(sentence)
                text_splits[-1] = text_splits[-1].lstrip()
            elif len(str(sentence)) > text_split_length:
                # if the current sentence is greater than the text_split_length
                for line in textwrap.wrap(
                    str(sentence),
                    width=text_split_length,
                    drop_whitespace=True,
                    break_on_hyphens=False,
                    tabsize=1,
                ):
                    text_splits.append(str(line))
            else:
                text_splits.append(str(sentence))

        if len(text_splits) > 1:
            if text_splits[0] == "":
                del text_splits[0]
    else:
        text_splits = [text.lstrip()]

    return text_splits
erew123 commented 1 month ago

Hi @Mixomo

Those warnings/limits are set within Coqui's scripts, at their suggestion/recommendation for their models. It is fair to say that character sets, other than English do have greater limits as the character sets of other languages take up more tokens per character.

Im not sure where you are stating that you are seeing the warning? You mean when Finetuning?

It shouldn't specifically be an issue when finetuning, though if you specifically wish to reduce the length of your training data text, you can do so by manually editing the csv files, identifying long lines of text that are over your language character set length and either removing them or splitting the line into 2 within the CSV and the splitting the associated audio into 2 files as well and referencing them correctly within your csv files.

If you want to find further information or ask more specific questions about the tokenizer, I would suggest the Coqui documentation and forums:

https://docs.coqui.ai/en/latest/finetuning.html https://docs.coqui.ai/en/latest/models/xtts.html https://github.com/coqui-ai/TTS https://github.com/coqui-ai/TTS/discussions

Thanks