[Bug] The voice-cloned speaker continues with garbage after to-be-spoken text was finished or mid-sentence

Bardo-Konrad commented 9 months ago

Describe the bug

Sometimes the speech pauses then the speaker continues but it's neither written nor is it any language, but it's clearly the same speaker. Unless you want to create a horror movie with a disturbingly familiar voice, this behaviour is undesired. I think bark has the same issue.

To Reproduce

device = "cuda" if torch.cuda.is_available() else "cpu"
was = 'tts_models/multilingual/multi-dataset/xtts_v2'
tts = TTS(model_name=was).to(device)
tts.tts_to_file(text="Some longer text", speaker_wav="some.wav", language="de", file_path="some-output.wav")

Expected behavior

Only speak what's being written.

kaveenkumar commented 8 months ago

Anyone has a workaround to this?

I tried finishing all my text with a period "." but that does not stop the synthesizer from ending. Often there are artifacts along with the input text.

Bardo-Konrad commented 8 months ago

Anyone has a workaround to this?

I tried finishing all my text with a period "." but that does not stop the synthesizer from ending. Often there are artifacts along with the input text.

Probably the only way around it is to generate speech, use speech to text, compare to input get timestamps of gibberish, remove, resave.

Kinda dumb, but what the heck.

stale[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

Bardo-Konrad commented 4 months ago

I want to draw attention to this.

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

kaveenkumar commented 3 months ago

Anyone has a workaround to this? I tried finishing all my text with a period "." but that does not stop the synthesizer from ending. Often there are artifacts along with the input text.

Probably the only way around it is to generate speech, use speech to text, compare to input get timestamps of gibberish, remove, resave.

Kinda dumb, but what the heck.

I am thinking of implementing this.. However, instead of gathering timestamps for gibberish (we don't know this variable) which is complex to execute, I would prefer to gather timestamps for the input text (we know this variable) and crop + save only this timestamp

stale[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

coqui-ai / TTS