KoljaB / RealtimeSTT

A robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription.
MIT License
1.32k stars 117 forks source link
python realtime speech-to-text

RealtimeSTT

Easy-to-use, low-latency speech-to-text library for realtime applications

New

Custom wake words with OpenWakeWord. Thanks to the developers of this!

About the Project

RealtimeSTT listens to the microphone and transcribes voice into text.

Hint: Check out Linguflex, the original project from which RealtimeSTT is spun off. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.

It's ideal for:

https://github.com/KoljaB/RealtimeSTT/assets/7604638/207cb9a2-4482-48e7-9d2b-0722c3ee6d14

Updates

Latest Version: v0.2.1

See release history.

Hint: Since we use the multiprocessing module now, ensure to include the if __name__ == '__main__': protection in your code to prevent unexpected behavior, especially on platforms like Windows. For a detailed explanation on why this is important, visit the official Python documentation on multiprocessing.

Features

Hint: Check out RealtimeTTS, the output counterpart of this library, for text-to-voice capabilities. Together, they form a powerful realtime audio wrapper around large language models.

Tech Stack

This library uses:

These components represent the "industry standard" for cutting-edge applications, providing the most modern and effective foundation for building high-end solutions.

Installation

pip install RealtimeSTT

This will install all the necessary dependencies, including a CPU support only version of PyTorch.

Although it is possible to run RealtimeSTT with a CPU installation only (use a small model like "tiny" or "base" in this case) you will get way better experience using:

GPU Support with CUDA (recommended)

Updating PyTorch for CUDA Support

To upgrade your PyTorch installation to enable GPU support with CUDA, follow these instructions based on your specific CUDA version. This is useful if you wish to enhance the performance of RealtimeSTT with CUDA capabilities.

For CUDA 11.8:

To update PyTorch and Torchaudio to support CUDA 11.8, use the following commands:

pip install torch==2.3.1+cu118 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118

For CUDA 12.X:

To update PyTorch and Torchaudio to support CUDA 12.X, execute the following:

pip install torch==2.3.1+cu121 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121

Replace 2.3.1 with the version of PyTorch that matches your system and requirements.

Steps That Might Be Necessary Before

Note: To check if your NVIDIA GPU supports CUDA, visit the official CUDA GPUs list.

If you didn't use CUDA models before, some additional steps might be needed one time before installation. These steps prepare the system for CUDA support and installation of the GPU-optimized installation. This is recommended for those who require better performance and have a compatible NVIDIA GPU. To use RealtimeSTT with GPU support via CUDA please also follow these steps:

  1. Install NVIDIA CUDA Toolkit:

  2. Install NVIDIA cuDNN:

    • select between CUDA 11.8 or CUDA 12.X Toolkit
      • for 12.X visit cuDNN Downloads.
        • Select operating system and version.
        • Download and install the software.
      • for 11.8 visit NVIDIA cuDNN Archive.
        • Click on "Download cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x".
        • Download and install the software.
  3. Install ffmpeg:

    Note: Installation of ffmpeg might not actually be needed to operate RealtimeSTT *thanks to jgilbert2017 for pointing this out

    You can download an installer for your OS from the ffmpeg Website.

    Or use a package manager:

Quick Start

Basic usage:

Manual Recording

Start and stop of recording are manually triggered.

recorder.start()
recorder.stop()
print(recorder.text())

Automatic Recording

Recording based on voice activity detection.

with AudioToTextRecorder() as recorder:
    print(recorder.text())

When running recorder.text in a loop it is recommended to use a callback, allowing the transcription to be run asynchronously:

def process_text(text):
    print (text)

while True:
    recorder.text(process_text)

Wakewords

Keyword activation before detecting voice. Write the comma-separated list of your desired activation keywords into the wake_words parameter. You can choose wake words from these list: alexa, americano, blueberry, bumblebee, computer, grapefruits, grasshopper, hey google, hey siri, jarvis, ok google, picovoice, porcupine, terminator.

recorder = AudioToTextRecorder(wake_words="jarvis")

print('Say "Jarvis" then speak.')
print(recorder.text())

Callbacks

You can set callback functions to be executed on different events (see Configuration) :

def my_start_callback():
    print("Recording started!")

def my_stop_callback():
    print("Recording stopped!")

recorder = AudioToTextRecorder(on_recording_start=my_start_callback,
                               on_recording_stop=my_stop_callback)

Feed chunks

If you don't want to use the local microphone set use_microphone parameter to false and provide raw PCM audiochunks in 16-bit mono (samplerate 16000) with this method:

recorder.feed_audio(audio_chunk)

Shutdown

You can shutdown the recorder safely by using the context manager protocol:

with AudioToTextRecorder() as recorder:
    [...]

Or you can call the shutdown method manually (if using "with" is not feasible):

recorder.shutdown()

Testing the Library

The test subdirectory contains a set of scripts to help you evaluate and understand the capabilities of the RealtimeTTS library.

Test scripts depending on RealtimeTTS library may require you to enter your azure service region within the script. When using OpenAI-, Azure- or Elevenlabs-related demo scripts the API Keys should be provided in the environment variables OPENAI_API_KEY, AZURE_SPEECH_KEY and ELEVENLABS_API_KEY (see RealtimeTTS)

The example_app subdirectory contains a polished user interface application for the OpenAI API based on PyQt5.

Configuration

Initialization Parameters for AudioToTextRecorder

When you initialize the AudioToTextRecorder class, you have various options to customize its behavior.

General Parameters

Real-time Transcription Parameters

Note: When enabling realtime description a GPU installation is strongly advised. Using realtime transcription may create high GPU loads.

Voice Activation Parameters

Wake Word Parameters

OpenWakeWord

Training models

Look here for information about how to train your own OpenWakeWord models. You can use a simple Google Colab notebook for a start or use a more detailed notebook that enables more customization (can produce high quality models, but requires more development experience).

Convert model to ONNX format

You might need to use tf2onnx to convert tensorflow tflite models to onnx format:

pip install -U tf2onnx
python -m tf2onnx.convert --tflite my_model_filename.tflite --output my_model_filename.onnx

Configure RealtimeSTT

Suggested starting parameters for OpenWakeWord usage:

    with AudioToTextRecorder(
        wakeword_backend="oww",
        wake_words_sensitivity=0.35,
        openwakeword_model_paths="word1.onnx,word2.onnx",
        wake_word_buffer_duration=1,
        ) as recorder:

Contribution

Contributions are always welcome!

Shoutout to Steven Linn for providing docker support.

License

MIT

Author

Kolja Beigel
Email: kolja.beigel@web.de
GitHub