Closed user080975 closed 3 months ago
You want to use the play or play_async method with the on_audio_chunk callback:
stream.play(on_audio_chunk=chunk_processor, muted=True)
Since every engine delivers different sample rates, you then need to retrieve the sample rate for the tts engine you use:
def chunk_processor(chunk):
_, _, sample_rate = engine.get_stream_info()
Maybe you need to resample the chunk to the target rate the client can play:
import librosa
# Convert to float32
audio_chunk = np.frombuffer(
chunk,
dtype=np.int16
).astype(np.float32) / 32768.0
# resample to desired target rate (for example 40000 Hz)
audio_chunk = librosa.resample(
audio_chunk,
orig_sr=samplerate,
target_sr=40000
)
Now you got a pcm chunk with a sample rate of your choice. You can send it straight to your javascript client and play it out there. Or if the js client needs it in another format you can convert the chunk before sending it.
Note that the Elevenlabs engine is not supported for this right now. It only delivers MP3 chunks unless you pay Creator tier or higher in a format that is hard to convert to pcm. If you really need Elevenlabs and are willing to pay for Creator I can guide you through the needed changes in the RealtimeTTS code to do this.
Thank you for your response! I will be using the Azure TTS api, do I need to make any specific changes in this case?
No. Azure engine will return chunks in pcm 16 bit mono 16 kHz as configured in RealtimeTTS. If the client can play them you can send the chunks straight away from the callback, I think.
Are there any examples of working Open AI chat completions streaming + real time tts speech playback? I tried the demo example but got this error:
AttributeError: 'OpenAI' object has no attribute 'ChatCompletion'
Try this example. Openai changed their API making some examples incompatible to their latest python client versions.
Or maybe if you are already using that you need to pip install --upgrade openai. Also don't name your file openai.py.
I tried that example with my OpenAI key but got this error. I'm sorry for the trouble, since I usually use Node JS and I'm not too familiar with Python.
File "/opt/homebrew/lib/python3.12/site-packages/RealtimeTTS/text_to_stream.py", line 171, in play for sentence in chunk_generator: File "/opt/homebrew/lib/python3.12/site-packages/RealtimeTTS/text_to_stream.py", line 329, in _synthesis_chunk_generator for chunk in generator: File "/opt/homebrew/lib/python3.12/site-packages/stream2sentence/stream2sentence.py", line 193, in generate_sentences for char in _generate_characters(generator, log_characters): File "/opt/homebrew/lib/python3.12/site-packages/stream2sentence/stream2sentence.py", line 85, in _generate_characters for chunk in generator: File "/opt/homebrew/lib/python3.12/site-packages/RealtimeTTS/threadsafe_generators.py", line 223, in __next__ token = next(self.generator) ^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.12/site-packages/RealtimeTTS/threadsafe_generators.py", line 152, in __next__ self.iterated_text += char TypeError: can only concatenate str (not "ChatCompletionChunk") to str
Is RealtimeTTS on latest version? pip install --upgrade RealtimeTTS
Yes I just installed RealtimeTTS today
Ok, I'll check that with the latest versions in a fresh venv, second.
Takes another 30 minutes on my slow german connection to download latest torch.
Nearly sure that RealtimeTTS is somehow not on latest version though. Check this commit.
threadsafe_generators.py does not contain "self.iterated_text += char" on line 152 anymore, but did on versions <= 0.0.35.
Could you pls check the realtime version with pip show realtimetts
? I believe there may have been an old cached version installed today. It should display something like:
C:\>pip show realtimetts
Name: RealTimeTTS
Version: 0.3.42
[...]
Here's what I see:
Name: RealTimeTTS Version: 0.1.3 Summary: *Stream text into audio with an easy-to-use, highly configurable library delivering voice output with minimal latency. Home-page: https://github.com/KoljaB/RealTimeTTS Author: Kolja Beigel Author-email: kolja.beigel@web.de License: Location: /opt/homebrew/lib/python3.12/site-packages Requires: azure-cognitiveservices-speech, elevenlabs, PyAudio, pyttsx3, requests, stream2sentence Required-by:
But I did the installation like this pip install RealtimeTTS
How can I install the latest version then?
Please try:
pip install RealTimeTTS --force-reinstall --upgrade --no-cache-dir --verbose
Unsure why it did not upgrade to the latest version automatically though. It did install 0.3.42 on my venv when I tried the same 30 minutes ago. Must be some issues with cache or pip.
pip install RealTimeTTS==0.3.42 should also work btw
It's strange, I created a completely new virtual environment for python 3 and then ran this:
pip install RealTimeTTS --force-reinstall --upgrade --no-cache-dir --verbose
Didn't get any errors.
But when I run this:
pip show realtimetts
I still see:
Name: RealTimeTTS Version: 0.1.3
But if I run this:
pip install RealTimeTTS==0.3.42
I get:
Collecting RealTimeTTS==0.3.42 Using cached RealTimeTTS-0.3.42-py3-none-any.whl.metadata (18 kB) Requirement already satisfied: requests in ./myenv/lib/python3.12/site-packages (from RealTimeTTS==0.3.42) (2.31.0) Requirement already satisfied: PyAudio in ./myenv/lib/python3.12/site-packages (from RealTimeTTS==0.3.42) (0.2.14) Requirement already satisfied: pyttsx3 in ./myenv/lib/python3.12/site-packages (from RealTimeTTS==0.3.42) (2.90) Requirement already satisfied: stream2sentence==0.2.2 in ./myenv/lib/python3.12/site-packages (from RealTimeTTS==0.3.42) (0.2.2) Requirement already satisfied: azure-cognitiveservices-speech in ./myenv/lib/python3.12/site-packages (from RealTimeTTS==0.3.42) (1.36.0) Requirement already satisfied: elevenlabs in ./myenv/lib/python3.12/site-packages (from RealTimeTTS==0.3.42) (0.2.27) INFO: pip is looking at multiple versions of realtimetts to determine which version is compatible with other requirements. This could take a while. ERROR: Ignored the following versions that require a different python version: 0.0.10.2 Requires-Python >=3.6.0, <3.9; 0.0.10.3 Requires-Python >=3.6.0, <3.9; 0.0.11 Requires-Python >=3.6.0, <3.9; 0.0.12 Requires-Python >=3.6.0, <3.9; 0.0.13.1 Requires-Python >=3.6.0, <3.9; 0.0.13.2 Requires-Python >=3.6.0, <3.9; 0.0.14.1 Requires-Python >=3.6.0, <3.9; 0.0.15 Requires-Python >=3.6.0, <3.9; 0.0.15.1 Requires-Python >=3.6.0, <3.9; 0.0.9 Requires-Python >=3.6.0, <3.9; 0.0.9.1 Requires-Python >=3.6.0, <3.9; 0.0.9.2 Requires-Python >=3.6.0, <3.9; 0.0.9a10 Requires-Python >=3.6.0, <3.9; 0.0.9a9 Requires-Python >=3.6.0, <3.9; 0.1.0 Requires-Python >=3.6.0, <3.10; 0.1.1 Requires-Python >=3.6.0, <3.10; 0.1.2 Requires-Python >=3.6.0, <3.10; 0.1.3 Requires-Python >=3.6.0, <3.10; 0.10.0 Requires-Python >=3.7.0, <3.11; 0.10.1 Requires-Python >=3.7.0, <3.11; 0.10.2 Requires-Python >=3.7.0, <3.11; 0.11.0 Requires-Python >=3.7.0, <3.11; 0.11.1 Requires-Python >=3.7.0, <3.11; 0.12.0 Requires-Python >=3.7.0, <3.11; 0.13.0 Requires-Python >=3.7.0, <3.11; 0.13.1 Requires-Python >=3.7.0, <3.11; 0.13.2 Requires-Python >=3.7.0, <3.11; 0.13.3 Requires-Python >=3.7.0, <3.11; 0.14.0 Requires-Python >=3.7.0, <3.11; 0.14.2 Requires-Python >=3.7.0, <3.11; 0.14.3 Requires-Python >=3.7.0, <3.11; 0.15.0 Requires-Python >=3.9.0, <3.12; 0.15.1 Requires-Python >=3.9.0, <3.12; 0.15.2 Requires-Python >=3.9.0, <3.12; 0.15.4 Requires-Python >=3.9.0, <3.12; 0.15.5 Requires-Python >=3.9.0, <3.12; 0.15.6 Requires-Python >=3.9.0, <3.12; 0.16.0 Requires-Python >=3.9.0, <3.12; 0.16.1 Requires-Python >=3.9.0, <3.12; 0.16.3 Requires-Python >=3.9.0, <3.12; 0.16.4 Requires-Python >=3.9.0, <3.12; 0.16.5 Requires-Python >=3.9.0, <3.12; 0.16.6 Requires-Python >=3.9.0, <3.12; 0.17.0 Requires-Python >=3.9.0, <3.12; 0.17.1 Requires-Python >=3.9.0, <3.12; 0.17.2 Requires-Python >=3.9.0, <3.12; 0.17.4 Requires-Python >=3.9.0, <3.12; 0.17.5 Requires-Python >=3.9.0, <3.12; 0.17.6 Requires-Python >=3.9.0, <3.12; 0.17.7 Requires-Python >=3.9.0, <3.12; 0.17.8 Requires-Python >=3.9.0, <3.12; 0.17.9 Requires-Python >=3.9.0, <3.12; 0.18.0 Requires-Python >=3.9.0, <3.12; 0.18.1 Requires-Python >=3.9.0, <3.12; 0.18.2 Requires-Python >=3.9.0, <3.12; 0.19.0 Requires-Python >=3.9.0, <3.12; 0.19.1 Requires-Python >=3.9.0, <3.12; 0.2.0 Requires-Python >=3.6.0, <3.10; 0.2.1 Requires-Python >=3.6.0, <3.10; 0.2.2 Requires-Python >=3.6.0, <3.10; 0.20.0 Requires-Python >=3.9.0, <3.12; 0.20.1 Requires-Python >=3.9.0, <3.12; 0.20.2 Requires-Python >=3.9.0, <3.12; 0.20.3 Requires-Python >=3.9.0, <3.12; 0.20.4 Requires-Python >=3.9.0, <3.12; 0.20.5 Requires-Python >=3.9.0, <3.12; 0.20.6 Requires-Python >=3.9.0, <3.12; 0.21.0 Requires-Python >=3.9.0, <3.12; 0.21.1 Requires-Python >=3.9.0, <3.12; 0.21.2 Requires-Python >=3.9.0, <3.12; 0.21.3 Requires-Python >=3.9.0, <3.12; 0.22.0 Requires-Python >=3.9.0, <3.12; 0.3.0 Requires-Python >=3.6.0, <3.10; 0.3.1 Requires-Python >=3.6.0, <3.10; 0.4.0 Requires-Python >=3.6.0, <3.10; 0.4.1 Requires-Python >=3.6.0, <3.10; 0.4.2 Requires-Python >=3.6.0, <3.10; 0.5.0 Requires-Python >=3.6.0, <3.10; 0.6.0 Requires-Python >=3.6.0, <3.10; 0.6.1 Requires-Python >=3.6.0, <3.10; 0.6.2 Requires-Python >=3.6.0, <3.10; 0.7.0 Requires-Python >=3.7.0, <3.11; 0.7.1 Requires-Python >=3.7.0, <3.11; 0.8.0 Requires-Python >=3.7.0, <3.11; 0.9.0 Requires-Python >=3.7.0, <3.11 ERROR: Could not find a version that satisfies the requirement TTS==0.22.0 (from realtimetts) (from versions: none) ERROR: No matching distribution found for TTS==0.22.0
Okay so apparently just manually downloading the codebase from GitHub and manually replacing the folder works 😂
I made a test run and the streaming text works great, but the talking speed seems much faster than default Azure TTS output for some reason.
And if the chat completion output include partial English and partial Chinese, it seems to stop speech.
For example if you try this prompt:
Generate a long paragraph in Chinese describing the university of saint gallen
The text output proceeds fine, but speech out stops after St.
圣加仑大学(University of St. Gallen)是瑞士最负盛名的一所高等教育机构之一....
Ok, looks like pip install does not support python version 3.12 for TTS, one of the dependent libraries of RealtimeTTS.
Talking speed can be changed in constructor of AzureEngine with the rate parameter. You are right, I somehow defaulted to 20% faster (should be 0% of course). Mixed language won't work with Azure, since every voice is trained on one language.
I notice that even if the whole stream content is in 100% Chinese, currently whenever the tts encounters a sentence ending punctuation e.g; 。
which is the Chinese equivalent to .
it will stop all speech until the text streaming has finished completely before resuming speech. This effectively means that the whole streamed text can't contain any punctuations apart from commas. Would you be able to try this on your end and observe this effect?
A simple prompt: Generate a story in Chinese
I'm using this exact code by the way:
if __name__ == '__main__':
client = OpenAI(api_key='sk-')
stream = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[{"role": "user", "content": "Generate a story in Chinese"}],
stream=True,
)
TextToAudioStream(AzureEngine(speech_key='', service_region='westeurope', voice='zh-CN-XiaochenMultilingualNeural', rate='0'), log_characters=True).feed(stream).play()
For chinese it's best to use the stanza tokenizer, which handles the 。better that the default nltk, which is specialized on english. So you'd initialize TextToAudioStream with tokenizer="stanza".
In stream.play you'd then also want to play with the parameters tokenizer="stanza", context_size=2 (because we don't need such a big lookahead of characters in chinese we can reduce latency by lowering the sentence end detection context). For coqui engine in chinese you would also use language="zh" parameter, but for Azure this not needed. You'd only need to pick a zh-CN voice for Azure.
What if the content contains multilingual content? e.g: content generated by GPT can contain both English and Chinese. In this case should the above settings still be applied?
The tokenizer would be stanza still, it would handle this. TTS of multilingual content is a problem for most engines. Elevenlabs multilingual models can handle it in basic (but would still stumble quite often in praxis with intensely mixed languages). Azure, System and Coqui engines can't synthesize mixed language sentences. Not sure about OpenAI, they maybe can.
I tried with the stanza settings like this but the outputted speech still stops at 。
TextToAudioStream(AzureEngine(speech_key='', service_region='westeurope', voice='zh-CN-XiaochenMultilingualNeural', rate='10', tokenizer="stanza"), log_characters=True).feed(stream).play(tokenizer="stanza", context_size=2)
Alternatively would it be possible to edit the stream
content to replace all 。
with .
before passing to play()
which would solve this problem?
Try to put tokenizer="stanza" to constructor of TextToAudioStream, in your code you have it in AzureEngine constructor.
Okay so I manually edited the TextToAudioStream
code directly and changed all tokenizer from nltk
to stanza
and this now fixes the Chinese speech problem. But when outputting English, the exact same issue happens in reverse, all speech is paused on any comma or dot. So currently it's not possible to output in speech any mixed / multilingual content?
I don't think the issue is with the Azure TTS API, since the voice zh-CN-XiaochenMultilingualNeural
is able to handle multiple types of languages in the Azure speech studio.
There should be no need to change the TextToAudioStream code, you could just create TextToAudioStream with parameter tokenizer="stanza".
Try to set the language of stanza to "multilingual" with the language parameter in TextToAudioStream constructor. Maybe you need to set it to "en" one time before that too, just to download the english model.
So something like:
TextToAudioStream(AzureEngine(speech_key='', service_region='westeurope', voice='zh-CN-XiaochenMultilingualNeural', rate='10'), log_characters=True, tokenizer="stanza", language="multilingual").feed(stream).play(tokenizer="stanza", context_size=2)`
Okay apparently changing context_size=30 helped solve the issue. But after switching to "stanza", whenever the code is called, there's 5+ second delay just to re-load required resources each time even though they've already been downloaded. Is there any way to "cache" or save them into memory without this loading step each time? Because I need to incorporate this with STT later for two way communications and this sort of breaks it.
I do see the option for this download_method=DownloadMethod.REUSE_RESOURCES
but I'm not sure where should I add this parameter.
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 4.19MB/s]
INFO:stanza:Downloading default packages for language: en (English) ...
INFO:stanza:File exists: /Users/mac/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /Users/mac/stanza_resources.
INFO:stanza:Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 9.91MB/s]
INFO:stanza:Loading these models for language: en (English):
======================================
| Processor | Package |
--------------------------------------
| tokenize | combined |
| pos | combined_charlm |
| lemma | combined_nocharlm |
| constituency | ptb3-revised_charlm |
| depparse | combined_charlm |
| sentiment | sstplus |
| ner | ontonotes_charlm |
======================================
INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: constituency
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!
No idea about this. This is quite stanza related stuff, I did not work with this tokenizer for a long time. Can't remember major issues. Maybe checking their github helps?
How do you feed btw? You should reuse the TextToAudioStream object:
stream = TextToAudioStream(AzureEngine(speech_key='', service_region='westeurope', voice='zh-CN-XiaochenMultilingualNeural', rate='10'), log_characters=True, tokenizer="stanza", language="multilingual")
stream.feed(llm_stream)
stream.play(tokenizer="stanza", context_size=2)`
So in practice the delay should only happen 1x at Start not everytime you use the stream
Awesome~
Hi! I'm not sure if this is the correct place to ask for this since it's not an issue, but I would like to learn more about how to accomplish streaming the audio to a site or browser plugin with javascript. I would prefer to use coqui as the engine.
I don't really know any javascript, but using some code from chatgpt I have an html page with javascript to start audio generation through a websocket. I still haven't quite understood what I need to do to stream the audio output back to the browser. Would it possible for you to help me getting started with it? @KoljaB
Here's what I'm currently working with:
import asyncio
import websockets
from RealtimeTTS import TextToAudioStream, CoquiEngine
import logging
async def audio_stream(websocket, path):
def read_file():
file = open('text_to_read.txt','r')
content = file.read()
yield content
def chunk_processor(chunk):
_, _, sample_rate = engine.get_stream_info()
logging.basicConfig(level=logging.INFO)
engine = CoquiEngine(level=logging.INFO)
stream = TextToAudioStream(engine)
async for audio_chunk in stream.feed(read_file()).play(log_synthesized_text=True, on_audio_chunk=chunk_processor, muted=True):
await websocket.send(audio_chunk)
async def start_server():
async with websockets.serve(audio_stream, "localhost", 8765):
print("WebSocket server started on ws://localhost:8765")
await asyncio.Future() # run forever
if __name__ == "__main__":
asyncio.run(start_server())
Thank you for the great work on this already! The speed of this version compared to standard Coqui is absolutely a game changer!
Ok, I'll try to come up with a code example how to do this in the next days.
Hi,
Is it possible to call this via frontend JS and then stream audio directly to browser for playback? If so, what should the approach be?
Thank You!