KoljaB / RealtimeTTS

Converts text to speech in realtime
1.81k stars 164 forks source link

Process with a VTT or SRT in realtime or not #140

Open ROBERT-MCDOWELL opened 2 days ago

ROBERT-MCDOWELL commented 2 days ago

It would be fantastic to use RealtimeTTS from a VTT or SRT file (or other subtitle formats) to let the engine respect the start time of each segment, so as this we can have a direct audio translation in realtime audio or recorded on an audio file (aac, wav or mp3 for example) Unless it's already possible to do it?

KoljaB commented 2 days ago

https://github.com/KoljaB/TurnVoice/blob/main/turnvoice%2Fcore%2Fsynthesis.py#L272

This does something very similar. I think the idea to process VTT and SRT is great. But hard to do in real-time. Might more be an add-on project.

ROBERT-MCDOWELL commented 1 day ago

well, even if it's not realtime it will help a lot already ;). I'm working on it for now but my biggest issue is to make a dummy device working as my computer does not have any soundcard.... how you could use synthesis.py in the VTT/SRT context?

KoljaB commented 1 day ago

I'd parse the file for lengths to get the duration and put this as desired_duration parameter to the synthesize_duration method. So I get the text spoken in the correct time. Fill up with silence for the parts where nothing is spoken and you're good I guess.

KoljaB commented 1 day ago

It's hard to make this realtime. Because the final duration of the synthesis generation is unknown beforehand (especially with neural TTS engines with a nondeterministic synthesis output) we testsynthesize here, measure the duration of the result and apply a speed correction factor afterwards. So we stretch the audio in place. But we need the full audio generated to do this, that's far away from realtime.

ROBERT-MCDOWELL commented 1 day ago

oh my! sorry I just realized the link you sent is another repo. turnvoice is already a very good start indeed! about realtime, indeed only pre chunks can do the trick, it won't be realtime but a kind of 1 to 3 sec latency. anyhow even in a presential meeting with human translator there is always a latency ;).

ROBERT-MCDOWELL commented 1 day ago

@KoljaB I opened a new discussion on turnvoice repo to discuss about vtt/srt import as I think it's a better repo to add an option to import SRT/VTT rather than video/audio then bypass STT, translation, and keep TTS as the only process.