[Question] Is there a way to save the streamed audio to file?

KoljaB / RealtimeTTS

Converts text to speech in realtime

1.41k stars 120 forks source link

[Question] Is there a way to save the streamed audio to file? #12

Open salahzoubi opened 7 months ago

salahzoubi commented 7 months ago

I've managed to get the RealTimeTTS library to work, I'm wondering if there's anyway to save/keep appending audio chunks as they come in to a file so that I play it back later on? I want to listen to the output audio exactly as the stream will be playing it using the play_async() function.

Thanks!

KoljaB commented 7 months ago

Not possible yet, but this is a great idea, I'll put in on the roadmap.

salahzoubi commented 7 months ago

Great! I also see that you can enable a parameter that logs synthesized speech, how do you access that once it's been logged?

KoljaB commented 7 months ago

Not implemented currently. How would you think about a callback for every sentence? Do you think, this should only be called after a sentence is synthesized or would you prefer to also have a callback right before synthesis?

salahzoubi commented 7 months ago

@KoljaB sounds good! I think a callback once a sentence is synthesized is a better choice. Also, just wondering, in terms of streaming text to streaming audio, does the play function wait for sentences to come in to start playing if you select that option? Or has that not been implemented yet?

KoljaB commented 7 months ago

For ever entity you feeded into the stream (string or generator), after calling play() it will try to extract sentences from it until it can't anymore (string is at end or generator is exhausted) and then synthesize the rest. So when using with LLMs it will basically wait for a sentence coming in.

KoljaB commented 7 months ago

v0.3.0 now allows audio file saving and play methods have a on_sentence_synthesized method. on_sentence_synthesized will be called before the sentence is being completely played back, so it's not a "sentence was just played" callback (since synthesis happens faster than playback if possible and their timings are independent).