KoljaB / RealtimeTTS

Converts text to speech in realtime
1.39k stars 119 forks source link

What parameters that should be used for speech generation for long text #66

Closed guijuzhejiang closed 2 months ago

guijuzhejiang commented 2 months ago

Thank you for your work, I think it's cool.I find that speech generated using short text works great, but when I try to use it to generate speech for longer text, the speech starts out fast and gets slower and slower later on, and there are occasional repetitive sentences, what are the appropriate parameters that should be used for speech generation for long text? Thank you.

guijuzhejiang commented 2 months ago

ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe Also when running the stream.play method the following warnings appear, can you tell me what these mean? Do I need to change anything?

ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side ALSA lib pcm_route.c:877:(find_matching_chmap) Found no matching channel map ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card' ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'

KoljaB commented 2 months ago

Never experienced anything like speech starting fast and then getting slower. Very strange, can I have some more info, which engine are you using with which parameters? How to you use it (play or play_async, do you use callbacks, do you you generators or stream llm outputs)?

Also I'm a total linux noob and sadly have no clue about ALSA and all those problems that can arise around that.

guijuzhejiang commented 2 months ago

Thanks for your reply, because of the Chinese transcription, I use: tokenizer="stanza", language="zh". The input generator of the feed in the code is a long Chinese text in string format, not a yield generator. Here is my specific code code:

def synthesize(generator, ref_wav_json): engine = CoquiEngine(voice=ref_wav_json, language="zh", speed=1.0, length_penalty=0.2, repetition_penalty=10.0, stream_chunk_size=20, overlap_wav_len=512, use_deepspeed=True,) # using a chinese cloning reference gives better quality stream = TextToAudioStream(engine, log_characters=True, tokenizer="stanza", language="zh") stream.feed(generator)

stream.play(
    minimum_sentence_length=2,
    minimum_first_fragment_length=2,
    output_wavfile="test.wav",
    tokenizer="stanza",
    language="zh",
    context_size=2,
)
KoljaB commented 2 months ago

Can't really see any direct cause of that problem currently. Maybe some issue with a special formatted text without any sentence delimiters that the tokenizer could split up into. Hard to say.

Your length_penalty bit low and overlap_wav_len is quite short but that should not result in the issue you described. I hope it isn't coqui tts model itself, but I feel with your description of what goes wrong this is quite hard to track down. Hope I will time to reproduce that soon. If you could mail me a chinese demo text which causes the issue to my mail kolja.beigel@web.de that would help be great and raise chances that I can reproduce.

If you are keen to look deeper under the hood for yourself into you could also try: import logging logging.basicConfig(level=logging.DEBUG)
engine = CoquiEngine(level=logging.DEBUG)

Not sure if logging even helps in a case like that.

guijuzhejiang commented 2 months ago

hi koljab, thank you for your attention to this issue, I am sending you the Chinese text file to your email. I hope it will help to restore the problem. I also used the original xtts project to generate the Chinese voice, and I had pretty much the same problem, so I think it's possible that the xtts-V2 model itself doesn't have perfect support for Chinese voices. So there may not be many modifications we can do, or I should try the approach of cloning the voice by training the model, e.g. GPT-SoVITS project. Thanks again for your attention.

guijuzhejiang commented 2 months ago

@KoljaB Thanks for the test, it seems likely that the problem occurs in the case of a long sentence, and perhaps it can be solved by adding punctuation to the long sentence. However some sentences do have long content and are not suitable for adding punctuation. Through the test, we also realized the upper limit of the XTTS model's capability, and it is likely that the XTTS model does not perform as well on long Chinese texts as it does on short ones. And the question about model capability is beyond the scope of the revision. Perhaps we can look forward to the release of xtts-V3 and improve the Chinese long text capability. Thanks again for taking the time to test. I will close it