Add AllTalk TTS and whisperX

kaminoer commented 4 months ago

Hey, I've introduced the following two modifications for my own use and figured you may want to take a look and see if it's something you'd like to implement. This is pretty crude and needs some refinement for sure but works. The following code is a drop-in replacement (you will probably want to add relevant config.py settings). The first snippet is for whisperX, the second one adds AllTalk TTS support. AllTalk TTS is a little bit more demanding than piper but offers way better voice quality. WhisperX lets you run this app 100% offline. With 12GB VRAM I'm running the tiny whisper model, a 7B/8B LLM (currently testing wizardlm2 and llama3 via Ollama) and my custom AllTalk model.

import whisperx as wx
from pydub import AudioSegment
import os
from dotenv import load_dotenv
from config import AUDIO_FILE_DIR
import gc

# Load .env file if present
load_dotenv()
device = "cuda"
batch_size = 16
compute_type = "int8"
model_dir = "C:\\test"
language = "en"
model = wx.load_model("tiny", device, language=language, compute_type=compute_type, download_root=model_dir)

def transcribe_audio(file_path):
    try:
        audio = AudioSegment.from_file(f"{AUDIO_FILE_DIR}/{file_path}")
        chunk_size = 10 * 60 * 1000
        num_chunks = len(audio) // chunk_size + (1 if len(audio) % chunk_size else 0)
        transcript = ""
        file_size = os.path.getsize(f"{AUDIO_FILE_DIR}/{file_path}")

        if file_size <= 24 * 1024 * 1024:
            result = model.transcribe(f"{AUDIO_FILE_DIR}/{file_path}", batch_size=batch_size)
            for segment in result['segments']:
                transcript += segment['text'] + " "
        else:
            for i in range(num_chunks):
                temp_chunk_path = f"{AUDIO_FILE_DIR}/temp_chunk.mp3"
                chunk = audio[i*chunk_size:(i+1)*chunk_size]
                with open(temp_chunk_path, 'wb') as f:
                    chunk.export(f, format="mp3")
                try:
                    result = model.transcribe(temp_chunk_path, batch_size=batch_size)
                    for segment in result['segments']:
                        transcript += segment['text'] + " "
                finally:
                    os.remove(temp_chunk_path)

        os.remove(f"{AUDIO_FILE_DIR}/{file_path}")

        return transcript

    except FileNotFoundError as e:
        raise FileNotFoundError(f"The audio file {file_path} was not found.") from e
    except Exception as e:
        raise Exception(f"An error occurred during the transcription process: {e}") from e

def cleanup_model():
    global model
    del model
    gc.collect()

import os
import soundfile as sf
import sounddevice as sd
from openai import OpenAI
from dotenv import load_dotenv
import subprocess
import threading
import queue
import config
import tempfile
import utils
import requests
import shutil

...
...

def TTS_Alltalk(self, text_to_speak, output_file):

        # Sanitize the input text by removing unsuitable characters
        text_to_speak = utils.sanitize_text(text_to_speak)

        # If there is no text left after sanitization, return "failed"
        if not text_to_speak.strip():
            return "failed"
        try:
            # Define the API endpoint
            api_url = "http://127.0.0.1:7851/api/tts-generate"

            # Prepare the data payload for the POST request
            data = {
                "text_input": text_to_speak,
                "text_filtering": "none",
                "character_voice_gen": "female_03.wav",
                "narrator_enabled": "false",
                "narrator_voice_gen": "arnold.wav",
                "text_not_inside": "character",
                "language": "en",
                "output_file_name": "output",
                "output_file_timestamp": "true",
                "autoplay": "false",
                "autoplay_volume": "0.8"
            }
            response = requests.post(api_url, data=data)
            print(response.content)
            response.raise_for_status() 
            response_data = response.json()
            if response_data["status"] == "generate-success":

                local_audio_path = response_data["output_file_path"]

                # Copy the file from the local path to the desired output file

                shutil.copyfile(local_audio_path, output_file)

            return "success"
        except requests.RequestException as e:
            print(f"Error calling TTS API: {e}")
            return "failed"

The latter snippet is not really an efficient solution as there is no need to copy the AllTalk generated wavs over to the AlwaysReddy audio_files directory. It would make more sense to change the AUDIO_FILE_DIR in config.py to point to the AllTalk output folder. Or change the output directory in AllTalk to point to AUDIO_FILE_DIR. If you think this may come in handy in any way, please feel free to use this code as you see fit.

ILikeAI commented 4 months ago

This is absolutely awesome! Thanks so much, Ill get to work implementing this asap 🙇‍♂️

kaminoer commented 4 months ago

Couple more thoughts:

I'm unsure if local whisper has any limitations as to the file size of the input audio file (most likely it doesn't). Chunking files larger than 24MB may no longer be necessary here.
The cleanup_model method I added is useful to purge the whisper model from VRAM. I call it from main.py upon KeyboardInterrupt to make sure the model is unloaded and doesn't occupy VRAM unnecessarily after exiting AlwaysReddy.

ILikeAI commented 4 months ago

I just push an updated version with local whisper, thanks again for the help! No Alltalk integration yet though

To your points:

I thought the same thing so I removed the chunking logic from the local whisper code.
I didnt think about this, my the version I pushed has the purge code removed, its a tricky trade off, I dont want to load up the model on each request if it will adds latency, i might need to test more to work out whats best. Possibly I could store the model in memory for N mins after the last usage before clearing it?

Thanks again, super appreciate your help!

nullnuller commented 4 months ago

I just push an updated version with local whisper, thanks again for the help! No Alltalk integration yet though

Hi just pulled but can't find the config option for the local whisper

### VOICE SETTINGS ###
PIPER_VOICE_JSON="en_en_US_amy_medium_en_US-amy-medium.onnx.json" #These are located in the piper_voices folder
PIPER_VOICE_ONNX="en_US-amy-medium.onnx"
TTS_ENGINE="piper" # 'piper' or 'openai' piper is local and fast but openai is better sounding
OPENAI_VOICE = "nova"

Also, it's saying


Failed to start the recorder: module 'config' has no attribute 'TRANSCRIPTION_API'``

kaminoer commented 4 months ago

I just push an updated version with local whisper, thanks again for the help! No Alltalk integration yet though

Hi just pulled but can't find the config option for the local whisper
### VOICE SETTINGS ###
PIPER_VOICE_JSON="en_en_US_amy_medium_en_US-amy-medium.onnx.json" #These are located in the piper_voices folder
PIPER_VOICE_ONNX="en_US-amy-medium.onnx"
TTS_ENGINE="piper" # 'piper' or 'openai' piper is local and fast but openai is better sounding
OPENAI_VOICE = "nova"
Also, it's saying
Failed to start the recorder: module 'config' has no attribute 'TRANSCRIPTION_API'``

Edit your your config file and uncomment the following (and remove/comment out the TRANSCRIPTION_API = "openai" line so as not to duplicate):

### Transcription API Settings ###

## Whisper X local transcription API EXAMPLE ##
TRANSCRIPTION_API = "whisperx" #local transcription!
WHISPER_MODEL = "tiny" # (tiny, base, small, medium, large) Turn this up to "base" if the transcription is too bad
TRANSCRIPTION_LANGUAGE = "en" 
WHISPER_BATCH_SIZE = 16
WHISPER_MODEL_PATH = None # you can point this to an existing model or leave it set to none

edit: Ah, I think I know what happend. You are probably still using your old config file. Go to the example config file, copy over the content to your config file and apply the above edits.

kaminoer commented 4 months ago

I just push an updated version with local whisper, thanks again for the help! No Alltalk integration yet though

To your points:

1. I thought the same thing so I removed the chunking logic from the local whisper code.

2. I didnt think about this, my the version I pushed has the purge code removed, its a tricky trade off, I dont want to load up the model on each request if it will adds latency, i might need to test more to work out whats best. Possibly I could store the model in memory for N mins after the last usage before clearing it?

Thanks again, super appreciate your help!

Makes sense to set up some sort of TTL for the whisper model! Maybe even make it configurable in config.py?

ILikeAI commented 4 months ago

Yeah its kind of a pain that you need to re copy the config file whenever I add something new to the config example.. Ideally I could push new features to the config without you having to copy over the new config entries each time, not sure how is best to do that though.

Yeah Ill add TTL to the todo list, at the end of the day it could be optional but it would be good to be able to do

zheroz00 commented 3 months ago

Anyone able to ELI5 this setup for me? I have the OG package setup, but I'm interested in trying WhisperX. Thanks.

ILikeAI commented 3 months ago

I have a bunch of changes in the works, hopefully by this time tomorrow they will all be merged, there will be a simplified setup process too.

In the next couple days I'll also make a new video on how to set it up on windows and Linux

POD319 commented 3 months ago

Is the local whisper option only able to use the embedded whisperx model, or is there any way we can point to our own whisper integration? I have a whisper server that I'd love to use with this.

ILikeAI commented 3 months ago

I've actually recently removed whisperX in place for faster-whisper which is much more light weight and seems to have less dependency issues.

So is your whisper server callable through like a network API? Could you give an example of how you could call it via code? I refactored the transcription system so it shouldn't be much work to set it up with a new transcription system

POD319 commented 3 months ago

I've actually recently removed whisperX in place for faster-whisper which is much more light weight and seems to have less dependency issues.

So is your whisper server callable through like a network API? Could you give an example of how you could call it via code? I refactored the transcription system so it shouldn't be much work to set it up with a new transcription system

Ah nice, yes it is very similar to faster-whisper. In fact I believe that was built off of this one, which is just the original (https://github.com/openai/whisper). It supports calls via web api, and also passing a .wav to the .exe which I think is what faster-whisper does.

Honestly the only reason I wanted to use the local whisper is so I can point to my own model that I have already. Not that the whisper models are all that big, but I'm just hitting the point where I am finding I have soooo many AI models and any application that can point to an existing model is a wonderful thing.

ILikeAI commented 3 months ago

Good point, Im not as familiar with faster whisper but im sure there would be a way to optionally point to an existing model, ill add it to the todo list and see if i can find a way to set that up.

Also side note, whisperX seems to have some dependencies that give fasterwhisper buggy results. So when you swap to faster whisper you will want to delete your venv and run through the setup again

Jake36921 commented 3 months ago

How do you edit the config to use the alltalk tts? I'm a bit lost on what 'relevant' settings I need to add

kaminoer commented 3 months ago

I don't think the alltalk tts code got added so it won't be enough to edit the config. Either wait for support to be added or drop-in my code as replacement for one of the existing tts apis.

ILikeAI / AlwaysReddy

Add AllTalk TTS and whisperX #12