BrasD99 / HeyGenClone

A simple and open-source analogue of the HeyGen system
862 stars 172 forks source link

Regarding the issue of synchronizing human voice audio #5

Closed wxbool closed 9 months ago

wxbool commented 9 months ago

Hello, I have a question about a difficulty I'm encountering in the process of replicating the HeyGen function. The core steps currently include text translation and Text-to-Speech (TTS). I see that the project is using the Google translation engine, but let's ignore the translation accuracy for now. My question is: since the length of the text will vary after translation due to different languages, this will also result in inconsistent lengths of human voice audio after TTS dubbing. When I need to keep the duration of the final output video consistent with the original video, how can I solve the problem of matching the dubbed human voice audio with the original video footage synchronously? Do you have any suggested methods?

BrasD99 commented 9 months ago

Hello @wxbool!

Yes, I have encountered this problem. To solve it, I started using whisperX, which, in addition to transcription, gives out time intervals of speech. Then the text is voiced by a cloned voice. From time intervals, you can get the duration of speech. And with the help of, for example, pydub, you can get the duration of audio with a cloned voice. These indicators will always be different. Our task is to make sure that the duration of the audio with the voiced text is the same as the duration of the original one in order to successfully replace it. Pydub has a speedup method that speeds up audio recording. However, what if we need to slow down on the contrary? That's why I use audiostretchy. It has a ratio parameter: "The stretch ratio, where values greater than 1.0 will extend the audio and values less than 1.0 will shorten the audio. From 0.5 to 2.0, or with -d from 0.25 to 4.0. Default is 1.0 = no stretching". Audiostretchy "performs fast, high-quality time-stretching of WAV/MP3 files without changing their pitch". Now it's enough for us to determine the ratio value - this is the ratio of the durations of our two audio.

There is a Mega TTS 2 model that would allow us not to bother with synchronizing audio durations. However, it is in closed access, and we have to wait for someone to be able to make a similar model and make it open-source.

seetimee commented 8 months ago

Is there any other new solutions?like extend video or some model can clone original voice's speed

BrasD99 commented 8 months ago

@seetimee https://github.com/facebookresearch/seamless_communication

seetimee commented 8 months ago

@seetimee https://github.com/facebookresearch/seamless_communication

thanks,bro