Closed Mixomo closed 1 month ago
Hi @Mixomo
Those warnings/limits are set within Coqui's scripts, at their suggestion/recommendation for their models. It is fair to say that character sets, other than English do have greater limits as the character sets of other languages take up more tokens per character.
Im not sure where you are stating that you are seeing the warning? You mean when Finetuning?
It shouldn't specifically be an issue when finetuning, though if you specifically wish to reduce the length of your training data text, you can do so by manually editing the csv files, identifying long lines of text that are over your language character set length and either removing them or splitting the line into 2 within the CSV and the splitting the associated audio into 2 files as well and referencing them correctly within your csv files.
If you want to find further information or ask more specific questions about the tokenizer, I would suggest the Coqui documentation and forums:
https://docs.coqui.ai/en/latest/finetuning.html https://docs.coqui.ai/en/latest/models/xtts.html https://github.com/coqui-ai/TTS https://github.com/coqui-ai/TTS/discussions
Thanks
[!] Warning: The text length exceeds the character limit of 239 for language 'es', this might cause truncated audio.
Hello, Does this warning have to be taken literally?
I have trained a couple of models, and they have not given me any problems with truncated audios. However the warning has me intrigued... is it due to the limits of the tokenizer for each language? Can this limit be bypassed? Or are the tokenizers already programmed with this limit? I saw reference to the limits in this portion of the tokenizer.py code (sorry if it's not the code the warning actually refers to).
Thank you