KoljaB / RealtimeTTS

Converts text to speech in realtime
1.39k stars 119 forks source link

Stream audio to frontend JS in browser? #51

Closed user080975 closed 3 months ago

user080975 commented 3 months ago

Hi,

Is it possible to call this via frontend JS and then stream audio directly to browser for playback? If so, what should the approach be?

Thank You!

KoljaB commented 3 months ago

You want to use the play or play_async method with the on_audio_chunk callback:

stream.play(on_audio_chunk=chunk_processor, muted=True)

Since every engine delivers different sample rates, you then need to retrieve the sample rate for the tts engine you use:

def chunk_processor(chunk):
    _, _, sample_rate = engine.get_stream_info()

Maybe you need to resample the chunk to the target rate the client can play:

import librosa
# Convert to float32
audio_chunk = np.frombuffer(
    chunk,
    dtype=np.int16
).astype(np.float32) / 32768.0

# resample to desired target rate (for example 40000 Hz)
audio_chunk = librosa.resample(
    audio_chunk,
    orig_sr=samplerate,
    target_sr=40000
)

Now you got a pcm chunk with a sample rate of your choice. You can send it straight to your javascript client and play it out there. Or if the js client needs it in another format you can convert the chunk before sending it.

Note that the Elevenlabs engine is not supported for this right now. It only delivers MP3 chunks unless you pay Creator tier or higher in a format that is hard to convert to pcm. If you really need Elevenlabs and are willing to pay for Creator I can guide you through the needed changes in the RealtimeTTS code to do this.

user080975 commented 3 months ago

Thank you for your response! I will be using the Azure TTS api, do I need to make any specific changes in this case?

KoljaB commented 3 months ago

No. Azure engine will return chunks in pcm 16 bit mono 16 kHz as configured in RealtimeTTS. If the client can play them you can send the chunks straight away from the callback, I think.

user080975 commented 3 months ago

Are there any examples of working Open AI chat completions streaming + real time tts speech playback? I tried the demo example but got this error: AttributeError: 'OpenAI' object has no attribute 'ChatCompletion'

KoljaB commented 3 months ago

Try this example. Openai changed their API making some examples incompatible to their latest python client versions.

KoljaB commented 3 months ago

Or maybe if you are already using that you need to pip install --upgrade openai. Also don't name your file openai.py.

user080975 commented 3 months ago

I tried that example with my OpenAI key but got this error. I'm sorry for the trouble, since I usually use Node JS and I'm not too familiar with Python.

File "/opt/homebrew/lib/python3.12/site-packages/RealtimeTTS/text_to_stream.py", line 171, in play for sentence in chunk_generator: File "/opt/homebrew/lib/python3.12/site-packages/RealtimeTTS/text_to_stream.py", line 329, in _synthesis_chunk_generator for chunk in generator: File "/opt/homebrew/lib/python3.12/site-packages/stream2sentence/stream2sentence.py", line 193, in generate_sentences for char in _generate_characters(generator, log_characters): File "/opt/homebrew/lib/python3.12/site-packages/stream2sentence/stream2sentence.py", line 85, in _generate_characters for chunk in generator: File "/opt/homebrew/lib/python3.12/site-packages/RealtimeTTS/threadsafe_generators.py", line 223, in __next__ token = next(self.generator) ^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.12/site-packages/RealtimeTTS/threadsafe_generators.py", line 152, in __next__ self.iterated_text += char TypeError: can only concatenate str (not "ChatCompletionChunk") to str

KoljaB commented 3 months ago

Is RealtimeTTS on latest version? pip install --upgrade RealtimeTTS

user080975 commented 3 months ago

Yes I just installed RealtimeTTS today

KoljaB commented 3 months ago

Ok, I'll check that with the latest versions in a fresh venv, second.

KoljaB commented 3 months ago

Takes another 30 minutes on my slow german connection to download latest torch.

Nearly sure that RealtimeTTS is somehow not on latest version though. Check this commit.

threadsafe_generators.py does not contain "self.iterated_text += char" on line 152 anymore, but did on versions <= 0.0.35.

Could you pls check the realtime version with pip show realtimetts? I believe there may have been an old cached version installed today. It should display something like:

C:\>pip show realtimetts
Name: RealTimeTTS
Version: 0.3.42
[...]
user080975 commented 3 months ago

Here's what I see:

Name: RealTimeTTS Version: 0.1.3 Summary: *Stream text into audio with an easy-to-use, highly configurable library delivering voice output with minimal latency. Home-page: https://github.com/KoljaB/RealTimeTTS Author: Kolja Beigel Author-email: kolja.beigel@web.de License: Location: /opt/homebrew/lib/python3.12/site-packages Requires: azure-cognitiveservices-speech, elevenlabs, PyAudio, pyttsx3, requests, stream2sentence Required-by:

But I did the installation like this pip install RealtimeTTS How can I install the latest version then?

KoljaB commented 3 months ago

Please try:

pip install RealTimeTTS --force-reinstall --upgrade --no-cache-dir --verbose
KoljaB commented 3 months ago

Unsure why it did not upgrade to the latest version automatically though. It did install 0.3.42 on my venv when I tried the same 30 minutes ago. Must be some issues with cache or pip.

KoljaB commented 3 months ago

pip install RealTimeTTS==0.3.42 should also work btw

user080975 commented 3 months ago

It's strange, I created a completely new virtual environment for python 3 and then ran this: pip install RealTimeTTS --force-reinstall --upgrade --no-cache-dir --verbose

Didn't get any errors.

But when I run this: pip show realtimetts

I still see: Name: RealTimeTTS Version: 0.1.3

user080975 commented 3 months ago

But if I run this: pip install RealTimeTTS==0.3.42

I get:

Collecting RealTimeTTS==0.3.42 Using cached RealTimeTTS-0.3.42-py3-none-any.whl.metadata (18 kB) Requirement already satisfied: requests in ./myenv/lib/python3.12/site-packages (from RealTimeTTS==0.3.42) (2.31.0) Requirement already satisfied: PyAudio in ./myenv/lib/python3.12/site-packages (from RealTimeTTS==0.3.42) (0.2.14) Requirement already satisfied: pyttsx3 in ./myenv/lib/python3.12/site-packages (from RealTimeTTS==0.3.42) (2.90) Requirement already satisfied: stream2sentence==0.2.2 in ./myenv/lib/python3.12/site-packages (from RealTimeTTS==0.3.42) (0.2.2) Requirement already satisfied: azure-cognitiveservices-speech in ./myenv/lib/python3.12/site-packages (from RealTimeTTS==0.3.42) (1.36.0) Requirement already satisfied: elevenlabs in ./myenv/lib/python3.12/site-packages (from RealTimeTTS==0.3.42) (0.2.27) INFO: pip is looking at multiple versions of realtimetts to determine which version is compatible with other requirements. This could take a while. ERROR: Ignored the following versions that require a different python version: 0.0.10.2 Requires-Python >=3.6.0, <3.9; 0.0.10.3 Requires-Python >=3.6.0, <3.9; 0.0.11 Requires-Python >=3.6.0, <3.9; 0.0.12 Requires-Python >=3.6.0, <3.9; 0.0.13.1 Requires-Python >=3.6.0, <3.9; 0.0.13.2 Requires-Python >=3.6.0, <3.9; 0.0.14.1 Requires-Python >=3.6.0, <3.9; 0.0.15 Requires-Python >=3.6.0, <3.9; 0.0.15.1 Requires-Python >=3.6.0, <3.9; 0.0.9 Requires-Python >=3.6.0, <3.9; 0.0.9.1 Requires-Python >=3.6.0, <3.9; 0.0.9.2 Requires-Python >=3.6.0, <3.9; 0.0.9a10 Requires-Python >=3.6.0, <3.9; 0.0.9a9 Requires-Python >=3.6.0, <3.9; 0.1.0 Requires-Python >=3.6.0, <3.10; 0.1.1 Requires-Python >=3.6.0, <3.10; 0.1.2 Requires-Python >=3.6.0, <3.10; 0.1.3 Requires-Python >=3.6.0, <3.10; 0.10.0 Requires-Python >=3.7.0, <3.11; 0.10.1 Requires-Python >=3.7.0, <3.11; 0.10.2 Requires-Python >=3.7.0, <3.11; 0.11.0 Requires-Python >=3.7.0, <3.11; 0.11.1 Requires-Python >=3.7.0, <3.11; 0.12.0 Requires-Python >=3.7.0, <3.11; 0.13.0 Requires-Python >=3.7.0, <3.11; 0.13.1 Requires-Python >=3.7.0, <3.11; 0.13.2 Requires-Python >=3.7.0, <3.11; 0.13.3 Requires-Python >=3.7.0, <3.11; 0.14.0 Requires-Python >=3.7.0, <3.11; 0.14.2 Requires-Python >=3.7.0, <3.11; 0.14.3 Requires-Python >=3.7.0, <3.11; 0.15.0 Requires-Python >=3.9.0, <3.12; 0.15.1 Requires-Python >=3.9.0, <3.12; 0.15.2 Requires-Python >=3.9.0, <3.12; 0.15.4 Requires-Python >=3.9.0, <3.12; 0.15.5 Requires-Python >=3.9.0, <3.12; 0.15.6 Requires-Python >=3.9.0, <3.12; 0.16.0 Requires-Python >=3.9.0, <3.12; 0.16.1 Requires-Python >=3.9.0, <3.12; 0.16.3 Requires-Python >=3.9.0, <3.12; 0.16.4 Requires-Python >=3.9.0, <3.12; 0.16.5 Requires-Python >=3.9.0, <3.12; 0.16.6 Requires-Python >=3.9.0, <3.12; 0.17.0 Requires-Python >=3.9.0, <3.12; 0.17.1 Requires-Python >=3.9.0, <3.12; 0.17.2 Requires-Python >=3.9.0, <3.12; 0.17.4 Requires-Python >=3.9.0, <3.12; 0.17.5 Requires-Python >=3.9.0, <3.12; 0.17.6 Requires-Python >=3.9.0, <3.12; 0.17.7 Requires-Python >=3.9.0, <3.12; 0.17.8 Requires-Python >=3.9.0, <3.12; 0.17.9 Requires-Python >=3.9.0, <3.12; 0.18.0 Requires-Python >=3.9.0, <3.12; 0.18.1 Requires-Python >=3.9.0, <3.12; 0.18.2 Requires-Python >=3.9.0, <3.12; 0.19.0 Requires-Python >=3.9.0, <3.12; 0.19.1 Requires-Python >=3.9.0, <3.12; 0.2.0 Requires-Python >=3.6.0, <3.10; 0.2.1 Requires-Python >=3.6.0, <3.10; 0.2.2 Requires-Python >=3.6.0, <3.10; 0.20.0 Requires-Python >=3.9.0, <3.12; 0.20.1 Requires-Python >=3.9.0, <3.12; 0.20.2 Requires-Python >=3.9.0, <3.12; 0.20.3 Requires-Python >=3.9.0, <3.12; 0.20.4 Requires-Python >=3.9.0, <3.12; 0.20.5 Requires-Python >=3.9.0, <3.12; 0.20.6 Requires-Python >=3.9.0, <3.12; 0.21.0 Requires-Python >=3.9.0, <3.12; 0.21.1 Requires-Python >=3.9.0, <3.12; 0.21.2 Requires-Python >=3.9.0, <3.12; 0.21.3 Requires-Python >=3.9.0, <3.12; 0.22.0 Requires-Python >=3.9.0, <3.12; 0.3.0 Requires-Python >=3.6.0, <3.10; 0.3.1 Requires-Python >=3.6.0, <3.10; 0.4.0 Requires-Python >=3.6.0, <3.10; 0.4.1 Requires-Python >=3.6.0, <3.10; 0.4.2 Requires-Python >=3.6.0, <3.10; 0.5.0 Requires-Python >=3.6.0, <3.10; 0.6.0 Requires-Python >=3.6.0, <3.10; 0.6.1 Requires-Python >=3.6.0, <3.10; 0.6.2 Requires-Python >=3.6.0, <3.10; 0.7.0 Requires-Python >=3.7.0, <3.11; 0.7.1 Requires-Python >=3.7.0, <3.11; 0.8.0 Requires-Python >=3.7.0, <3.11; 0.9.0 Requires-Python >=3.7.0, <3.11 ERROR: Could not find a version that satisfies the requirement TTS==0.22.0 (from realtimetts) (from versions: none) ERROR: No matching distribution found for TTS==0.22.0

user080975 commented 3 months ago

Okay so apparently just manually downloading the codebase from GitHub and manually replacing the folder works 😂

I made a test run and the streaming text works great, but the talking speed seems much faster than default Azure TTS output for some reason.

And if the chat completion output include partial English and partial Chinese, it seems to stop speech.

For example if you try this prompt: Generate a long paragraph in Chinese describing the university of saint gallen

The text output proceeds fine, but speech out stops after St.

圣加仑大学(University of St. Gallen)是瑞士最负盛名的一所高等教育机构之一....

KoljaB commented 3 months ago

Ok, looks like pip install does not support python version 3.12 for TTS, one of the dependent libraries of RealtimeTTS.

Talking speed can be changed in constructor of AzureEngine with the rate parameter. You are right, I somehow defaulted to 20% faster (should be 0% of course). Mixed language won't work with Azure, since every voice is trained on one language.

user080975 commented 3 months ago

I notice that even if the whole stream content is in 100% Chinese, currently whenever the tts encounters a sentence ending punctuation e.g; which is the Chinese equivalent to . it will stop all speech until the text streaming has finished completely before resuming speech. This effectively means that the whole streamed text can't contain any punctuations apart from commas. Would you be able to try this on your end and observe this effect?

A simple prompt: Generate a story in Chinese

user080975 commented 3 months ago

I'm using this exact code by the way:

    if __name__ == '__main__':
        client = OpenAI(api_key='sk-')

        stream = client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[{"role": "user", "content": "Generate a story in Chinese"}],
            stream=True,
        )

        TextToAudioStream(AzureEngine(speech_key='', service_region='westeurope', voice='zh-CN-XiaochenMultilingualNeural', rate='0'), log_characters=True).feed(stream).play()
KoljaB commented 3 months ago

For chinese it's best to use the stanza tokenizer, which handles the 。better that the default nltk, which is specialized on english. So you'd initialize TextToAudioStream with tokenizer="stanza".

In stream.play you'd then also want to play with the parameters tokenizer="stanza", context_size=2 (because we don't need such a big lookahead of characters in chinese we can reduce latency by lowering the sentence end detection context). For coqui engine in chinese you would also use language="zh" parameter, but for Azure this not needed. You'd only need to pick a zh-CN voice for Azure.

user080975 commented 3 months ago

What if the content contains multilingual content? e.g: content generated by GPT can contain both English and Chinese. In this case should the above settings still be applied?

KoljaB commented 3 months ago

The tokenizer would be stanza still, it would handle this. TTS of multilingual content is a problem for most engines. Elevenlabs multilingual models can handle it in basic (but would still stumble quite often in praxis with intensely mixed languages). Azure, System and Coqui engines can't synthesize mixed language sentences. Not sure about OpenAI, they maybe can.

user080975 commented 3 months ago

I tried with the stanza settings like this but the outputted speech still stops at TextToAudioStream(AzureEngine(speech_key='', service_region='westeurope', voice='zh-CN-XiaochenMultilingualNeural', rate='10', tokenizer="stanza"), log_characters=True).feed(stream).play(tokenizer="stanza", context_size=2)

Alternatively would it be possible to edit the stream content to replace all with . before passing to play() which would solve this problem?

KoljaB commented 3 months ago

Try to put tokenizer="stanza" to constructor of TextToAudioStream, in your code you have it in AzureEngine constructor.

user080975 commented 3 months ago

Okay so I manually edited the TextToAudioStream code directly and changed all tokenizer from nltk to stanza and this now fixes the Chinese speech problem. But when outputting English, the exact same issue happens in reverse, all speech is paused on any comma or dot. So currently it's not possible to output in speech any mixed / multilingual content?

I don't think the issue is with the Azure TTS API, since the voice zh-CN-XiaochenMultilingualNeural is able to handle multiple types of languages in the Azure speech studio.

KoljaB commented 3 months ago

There should be no need to change the TextToAudioStream code, you could just create TextToAudioStream with parameter tokenizer="stanza".

Try to set the language of stanza to "multilingual" with the language parameter in TextToAudioStream constructor. Maybe you need to set it to "en" one time before that too, just to download the english model.

So something like:

TextToAudioStream(AzureEngine(speech_key='', service_region='westeurope', voice='zh-CN-XiaochenMultilingualNeural', rate='10'), log_characters=True, tokenizer="stanza", language="multilingual").feed(stream).play(tokenizer="stanza", context_size=2)`
user080975 commented 3 months ago

Okay apparently changing context_size=30 helped solve the issue. But after switching to "stanza", whenever the code is called, there's 5+ second delay just to re-load required resources each time even though they've already been downloaded. Is there any way to "cache" or save them into memory without this loading step each time? Because I need to incorporate this with STT later for two way communications and this sort of breaks it.

I do see the option for this download_method=DownloadMethod.REUSE_RESOURCES but I'm not sure where should I add this parameter.

    Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 4.19MB/s]                                                                                                  
    INFO:stanza:Downloading default packages for language: en (English) ...
    INFO:stanza:File exists: /Users/mac/stanza_resources/en/default.zip
    INFO:stanza:Finished downloading models and saved to /Users/mac/stanza_resources.
    INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
    Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 9.91MB/s]                                                                                                  
    INFO:stanza:Loading these models for language: en (English):
    ======================================
    | Processor    | Package             |
    --------------------------------------
    | tokenize     | combined            |
    | pos          | combined_charlm     |
    | lemma        | combined_nocharlm   |
    | constituency | ptb3-revised_charlm |
    | depparse     | combined_charlm     |
    | sentiment    | sstplus             |
    | ner          | ontonotes_charlm    |
    ======================================

    INFO:stanza:Using device: cpu
    INFO:stanza:Loading: tokenize
    INFO:stanza:Loading: pos
    INFO:stanza:Loading: lemma
    INFO:stanza:Loading: constituency
    INFO:stanza:Loading: depparse
    INFO:stanza:Loading: sentiment
    INFO:stanza:Loading: ner
    INFO:stanza:Done loading processors!
KoljaB commented 3 months ago

No idea about this. This is quite stanza related stuff, I did not work with this tokenizer for a long time. Can't remember major issues. Maybe checking their github helps?

KoljaB commented 3 months ago

How do you feed btw? You should reuse the TextToAudioStream object:

stream = TextToAudioStream(AzureEngine(speech_key='', service_region='westeurope', voice='zh-CN-XiaochenMultilingualNeural', rate='10'), log_characters=True, tokenizer="stanza", language="multilingual")
stream.feed(llm_stream)
stream.play(tokenizer="stanza", context_size=2)`

So in practice the delay should only happen 1x at Start not everytime you use the stream

eraclioone commented 3 months ago

Awesome~

Kaljami commented 3 months ago

Hi! I'm not sure if this is the correct place to ask for this since it's not an issue, but I would like to learn more about how to accomplish streaming the audio to a site or browser plugin with javascript. I would prefer to use coqui as the engine.

I don't really know any javascript, but using some code from chatgpt I have an html page with javascript to start audio generation through a websocket. I still haven't quite understood what I need to do to stream the audio output back to the browser. Would it possible for you to help me getting started with it? @KoljaB

Here's what I'm currently working with:

import asyncio
import websockets
from RealtimeTTS import TextToAudioStream, CoquiEngine
import logging

async def audio_stream(websocket, path):
    def read_file():
        file = open('text_to_read.txt','r')
        content = file.read()
        yield content

    def chunk_processor(chunk):
        _, _, sample_rate = engine.get_stream_info()

    logging.basicConfig(level=logging.INFO)    
    engine = CoquiEngine(level=logging.INFO)
    stream = TextToAudioStream(engine)

    async for audio_chunk in stream.feed(read_file()).play(log_synthesized_text=True, on_audio_chunk=chunk_processor, muted=True):
        await websocket.send(audio_chunk)

async def start_server():
    async with websockets.serve(audio_stream, "localhost", 8765):
        print("WebSocket server started on ws://localhost:8765")
        await asyncio.Future()  # run forever

if __name__ == "__main__":
    asyncio.run(start_server())

Thank you for the great work on this already! The speed of this version compared to standard Coqui is absolutely a game changer!

KoljaB commented 3 months ago

Ok, I'll try to come up with a code example how to do this in the next days.