CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.87k stars 8.81k forks source link

How to implement the Real-Time-Voice-Cloning in other scripts? #890

Closed Andredenise closed 2 years ago

Andredenise commented 3 years ago

This is a lovely program, but I`m searching for a way to implement the voice cloning in my other script. It should run automatically without using the toolbox. So that it can be used as a speech assistent that sounds the way I want. Is there a way to do this? Thanks!

ghost commented 3 years ago

This is what you are looking for: demo_cli.py

Andredenise commented 3 years ago

Running this in a other script will automatically start the real time cloning without any inputs to give?

ghost commented 3 years ago

No, it will require modifications. Read the comments and adapt it to suit your needs.

Andredenise commented 3 years ago

okay thanks!

Andredenise commented 3 years ago

Is there someone who is able to do this modifications, because I think this is out of my league

ghost commented 3 years ago

Can you describe how you would want it to work? We can try to improve the interface.

For example:

import sounddevice as sd
from real_time_voice_cloning import RTVC

voice = RTVC() #performs initialization using config file
voice.clone("target_voice.wav")
audio = voice.text_to_speech(["Hello, this is your voice assistant.", "What can I do for you today?"])
sd.play(audio)
Andredenise commented 3 years ago

My project is actually a kind of speech assistant that is based on the open Ai GPT 2 model with who you can have a conversation about a specific topic. The problem for me is that the Open Ai script of the model is not a text to speech software so it responds with text. My idea was to implement this script in the model so that you actually can have a conversation "with yourself" because of the cloning tool. So the real-time-voice-cloning tool should actually be my way of outputting the results of the open Ai model. I hope this is kind of clear?

ghost commented 3 years ago

I think what you want is:

  1. Record the user with the computer's microphone
  2. Use automatic speech recognition to determine what the user said
  3. Input to GPT-2 and get a response (text)
  4. Provide text and voice recording to Real-Time-Voice-Cloning to generate audio in the user's voice and play it on speakers or headphones

My questions are:

Andredenise commented 3 years ago

First of all, my apologies for the lack of clarity in my project. I am a master student graphic design at luca school of arts in Ghent, Belgium. This project is actually my thesis and therefore not a paying project, but very important to me. The 4 steps are completely what I want to achieve. At the moment this is a private project but if it works, it can certainly become open source for me. Concerning the gui or web interface, this is definitely not a priority for me, if this project is able to work, it is fine For me to run it in the prompt, as the focus is on speaking and conversation. Thank you very much!

ghost commented 3 years ago

Here is a version of demo_cli.py that doesn’t require any user interaction. https://github.com/blue-fish/Real-Time-Voice-Cloning/blob/890_non_interactive_generation/demo_cli.py

You would run it like this. gpt_output.txt can have multiple lines. It is recommended to add a line break after each sentence, because the synthesizer struggles when the inputs become too long.

python demo_cli.py --skip_tests --no_save original_voice.wav gpt_output.txt

If the rest of your project uses Python, it is possible to use parts of demo_cli.py to do the generation without having to use Real-Time-Voice-Cloning separately like this. However, that is something you will have to figure out on your own. Good luck with your project!

Andredenise commented 3 years ago

It's still a bit unclear to me. Below you will find the script for the interactive form of the gpt2 model that I use. How do I integrate your forwarded version into it?

`import fire import json import os import numpy as np import tensorflow as tf

import model, sample, encoder

def interact_model( model_name='124M', seed=None, nsamples=1, batch_size=1, length=None, temperature=1, top_k=0, top_p=1, models_dir='models', ): """ Interactively run the model :model_name=124M : String, which model to use :seed=None : Integer seed for random number generators, fix seed to reproduce results :nsamples=1 : Number of samples to return total :batch_size=1 : Number of batches (only affects speed/memory). Must divide nsamples. :length=None : Number of tokens in generated text, if None (default), is determined by model hyperparameters :temperature=1 : Float value controlling randomness in boltzmann distribution. Lower temperature results in less random completions. As the temperature approaches zero, the model will become deterministic and repetitive. Higher temperature results in more random completions. :top_k=0 : Integer value controlling diversity. 1 means only 1 word is considered for each step (token), resulting in deterministic completions, while 40 means 40 words are considered at each step. 0 (default) is a special setting meaning no restrictions. 40 generally is a good value. :models_dir : path to parent folder containing model subfolders (i.e. contains the folder) """ models_dir = os.path.expanduser(os.path.expandvars(models_dir)) if batch_size is None: batch_size = 1 assert nsamples % batch_size == 0

enc = encoder.get_encoder(model_name, models_dir)
hparams = model.default_hparams()
with open(os.path.join(models_dir, model_name, 'hparams.json')) as f:
    hparams.override_from_dict(json.load(f))

if length is None:
    length = hparams.n_ctx // 2
elif length > hparams.n_ctx:
    raise ValueError("Can't get samples longer than window size: %s" % hparams.n_ctx)

with tf.Session(graph=tf.Graph()) as sess:
    context = tf.placeholder(tf.int32, [batch_size, None])
    np.random.seed(seed)
    tf.set_random_seed(seed)
    output = sample.sample_sequence(
        hparams=hparams, length=length,
        context=context,
        batch_size=batch_size,
        temperature=temperature, top_k=top_k, top_p=top_p
    )

    saver = tf.train.Saver()
    ckpt = tf.train.latest_checkpoint(os.path.join(models_dir, model_name))
    saver.restore(sess, ckpt)

    while True:
        raw_text = input("Model prompt >>> ")
        while not raw_text:
            print('Prompt should not be empty!')
            raw_text = input("Model prompt >>> ")
        context_tokens = enc.encode(raw_text)
        generated = 0
        for _ in range(nsamples // batch_size):
            out = sess.run(output, feed_dict={
                context: [context_tokens for _ in range(batch_size)]
            })[:, len(context_tokens):]
            for i in range(batch_size):
                generated += 1
                text = enc.decode(out[i])
                print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
                print(text)
        print("=" * 80)

if name == 'main': fire.Fire(interact_model)`

ghost commented 3 years ago

Integrate this code with your script instead. It assumes you have the Real-Time-Voice-Cloning repository at the same level as your script.

https://gist.github.com/blue-fish/ecabbca4f1a69701d32852f9f446c077

Andredenise commented 3 years ago

I've only been working with python for a few months, so I'm not very good at programming yet. You had previously given a good summary of what the program should look like as a whole. Namely like this:

  1. Record the user with the computer's microphone
  2. Use automatic speech recognition to determine what the user said
  3. Input to GPT-2 and get a response (text)
  4. Provide text and voice recording to Real-Time-Voice-Cloning to generate audio in the user's voice and play it on speakers or headphones

Is it possible to give me an overview of how the demo_cli.py (new version you sent) and the demo_rtvc.py should now work with the gpt-2 model. What should the structure look like as a whole? My apologies for the many questions.

ghost commented 3 years ago

Is it possible to give me an overview of how the demo_cli.py (new version you sent) and the demo_rtvc.py should now work with the gpt-2 model.

The code in demo_rtvc.py is structured like this:

## Section 1: Setup
# Imports
# Load models
# Use encoder to make speaker embedding

## Section 2: Generation
# Use synthesizer to make mel spectrograms from text
# Use vocoder to make waveform audio from mels
# Play waveform audio

You put Section 1 into the beginning of your script.

You have a while loop which uses GPT to make a variable called text. This needs to be split into a list of sentences called texts. After that, you can insert the code from Section 2 to generate and play back the audio.

What should the structure look like as a whole?

To build the program I outlined, you have to figure out what each of the sections is doing. This will help you put the parts together.

## Setup
    # Put all the imports here.

## 1. Record the user with the computer's microphone
    # Inputs: None
    # Outputs: An audio waveform (as a numpy array)
    #
    # The code needs to initialize the input audio device and capture the audio.
    # This is how it is done in the toolbox:
    # https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/toolbox/ui.py#L219

## 2. Use automatic speech recognition to determine what the user said
    # Inputs: Audio waveform from step 1
    # Outputs: Text
    #
    # First, the code needs to set up your ASR package.
    # Next, pass the audio to ASR to get the user's transcribed speech.

## 3. Input to GPT-2 and get a response (text)
    # Inputs: Text (transcribed speech from user in step 2)
    # Outputs: Text
    #
    # Your script already does this.

## 4. Provide text and voice recording to Real-Time-Voice-Cloning to generate audio in the user's voice and play it on speakers or headphones
    # Inputs:  Audio waveform from step 1
    #.         Text (GPT output from step 3)
    #
    # Outputs: None
    #
    # Most of these functions are demonstrated in demo_rtvc.py
    # a. This section first creates a speaker embedding from the recording in step 1.
    # b. You may also want to break up the GPT text into individual sentences.
    # c. Input the embedding and text to the synthesizer, to get a mel spectrogram.
    # d. Give the mel spectrogram to the vocoder to get an audio waveform (as a numpy array)
    # e. Play back the audio through the computer's output sound device.

My apologies for the many questions.

No need to apologize, but please understand that it takes away from my primary focus of improving the repo's codebase and model quality. I normally don't answer questions or provide help with personal projects, because I like to work on things that help all users. In this case, I made an exception because I remember what it is like to be new to programming and know that having a little guidance can make an impact.

Take some time to consider what I have provided here, and try to start building the program. Since I provided demo_rtvc.py, you can start at the end and work backwards. The development steps could look like this.

  1. Run demo_rtvc.py in Python to make sure it is set up properly.
  2. Add your GPT code, so the user will type in a prompt, and get an audio response played back through the computer speakers or headphones.
  3. Find an ASR package. Do not use the microphone yet. Instead, load a prerecorded wav file and use that to generate the text prompt for GPT.
  4. Finally, record the user's voice using the microphone. Use the recording in place of the prerecorded file for ASR. At this point you will be done!

Please try to make use of other support channels (Stack Overflow, etc.) whenever possible. And consider releasing your project as open source when it is done.

Andredenise commented 3 years ago

I cannot explain how much (thinking) work this saves me. This is very useful and clear! Thank you so much for helping me so hard!! I am always willing to share my final result if it would be interesting and useful.

lavarith commented 3 years ago

Just wanted to +100 that this discussion has been a very useful read for me and my project as well. Thanks @blue-fish !

Andredenise commented 3 years ago

hey

I still have 1 structural problem with my project that I cannot solve. In order for the Voice_cloner to work I have to enter a prerecorded wav sample in this code.

wav_path = "samples/p240_00000.mp3" original_wav, sampling_rate = librosa.load(wav_path) preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate) embed = encoder.embed_utterance(preprocessed_wav) print("Created the embedding")

Currently my code looks like this, but it doesn't work.

` import speech_recognition as sr import sounddevice as sd import numpy as np import os from scipy.io.wavfile import write from time import sleep import soundfile as sf import matplotlib.pyplot as plt from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas from matplotlib.figure import Figure from PyQt5.QtCore import Qt, QStringListModel from PyQt5.QtWidgets import * from pathlib import Path from typing import List, Set import umap import sys import pyaudio import wave from pathlib import Path import numpy as np import soundfile as sf import librosa import argparse import torch from audioread.exceptions import NoBackendError

colormap = np.array([ [0, 127, 70], [255, 0, 0], [255, 217, 38], [0, 135, 255], [165, 0, 165], [255, 167, 255], [97, 142, 151], [0, 255, 255], [255, 96, 38], [142, 76, 0], [33, 0, 127], [0, 0, 0], [183, 183, 183], [76, 255, 0], ], dtype=np.float) / 255

def record_one(self, sample_rate, duration):

self.log("Recording %d seconds of audio" % duration)
sd.stop()
try:
    wav = sd.rec(duration * sample_rate, sample_rate, 1)
except Exception as e:
    print(e)
    self.log("Could not record anything. Is your recording device enabled?")
    return None

for i in np.arange(0, duration, 0.1):
    self.set_loading(i, duration)
    sleep(0.1)
self.set_loading(duration, duration)
sd.wait()

self.log("Done recording.")

wav_file_name = r"C:\Users\thoma\OneDrive\Documenten\Master\Scriptie\AI Writer\Final\Final_mapje\Speakerrecordings\recorded_voice.wav"
if os.path.isfile(wav_file_name):
    expand = 1
    while True:
        expand += 1
        new_wav_file_name = wav_file_name.split(".wav")[0] + str(expand) + ".wav"
        if os.path.isfile(new_wav_file_name):
            continue
        else:
            wav_file_name = new_wav_file_name
            break

voice_file = write(wav_file_name, sample_rate, wav.astype(np.float32))  # Save as WAV file

#return wav.squeeze()
return voice_file

while(1): voice_file = record_one() # get the voice file `

I was wondering if it was possible to set the sample_rate and duration yourself or if they depend on the other encodings? Is it possible to help me fix this issue?

ghost commented 3 years ago

Here is the embedding code from demo_rtvc.py with comments added.

# This part loads a wav file from disk
wav_path = "samples/p240_00000.mp3"
original_wav, sampling_rate = librosa.load(wav_path)

# This part creates the embedding
preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
embed = encoder.embed_utterance(preprocessed_wav)

Now that you're ready to record the user through the microphone, the first part isn't needed. However, you still need to provide original_wav and sampling_rate. The sample rate is user-specified. The wav comes from sounddevice's rec() function.

For recording, you only need a single line of code. Here is a minimal example that uses it.

# Add these imports as needed at the top of your program
import sounddevice as sd
from tqdm import trange
from time import sleep, time

# Choose a sample rate that is compatible with your hardware
sampling_rate = 44100
duration = 5  # seconds

# Start recording the user
original_wav = sd.rec(duration*sampling_rate, sampling_rate, 1)

# Display a progress bar while recording
# This also blocks the next part of the code from running before recording is complete
for i in trange(duration):
    sleep(1)

# This part creates the embedding
preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
embed = encoder.embed_utterance(preprocessed_wav)

If you don't want to record for a fixed duration, you can replace the "display a progress bar" section with this code.

# Get current time to keep track of recording length
start_time = time()
print("Press enter to stop recording")
input()  # This blocks the program from continuing until user presses enter

# Trim wav to actual length of recording
recording_length = time() - start_time
if recording_length < duration:
    original_wav = original_wav[:int(recording_length*sampling_rate)]
Andredenise commented 3 years ago

First of all, thank you for replying so quickly!

If I understand correctly I should put this part in the beginning of my program:

original_wav = sd.rec(duration*sampling_rate, sampling_rate, 1)

, along with the encoding for duration and sample_rate? I was wondering if I didn't need an if statement that checks if the wav_path already exists and if so creates a new file so that it can run in the large while loop of the program? Or does the voice_cloning only need 1 sample wav to reconstruct the sound?

Thank you in advance!

ghost commented 3 years ago

All 4 parts in the program outline can be inside the while loop. This will cause the speaker embedding to be generated on each iteration using the latest voice recording. It might be a nice feature so the program does not need to be restarted when switching between different human participants.

If this is not what you want, you can prevent the embed from being updated, by initializing it to None and running the speaker encoder if it doesn't already exist. It would look something like this:

# This goes in the setup portion of the script
embed = None

while program_is_running:
    ## 1. Record the user with the computer's microphone
    original_wav = sd.rec(duration*sampling_rate, sampling_rate, 1)

    ## 2. Use automatic speech recognition to determine what the user said
    ## 3. Input to GPT-2 and get a response (text)

    ## 4. Provide text and voice recording to Real-Time-Voice-Cloning to generate audio in the user's voice and play it on speakers or headphones
    if not embed:    # This makes it run only on the first iteration of the while loop
        preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
        embed = encoder.embed_utterance(preprocessed_wav)

If you have been following the development steps in order, you should have a working speech recognition and voice cloning, which both start by loading a wav file from disk into memory. The code I provided records the user's voice directly into memory so you do not need to check for a wav_path or create any files.

Andredenise commented 3 years ago

Okay super! I have currently chosen to put all 4 parts in the while loop, so I will continue to work on that as it is also interesting for several participants. Most of it is done, but I'm still thinking about a quit option to quit the program, since pressing a button doesn't seem that interesting in a conversation with the program. Huge thanks for the help!

AlexSteveChungAlvarez commented 3 years ago

I'm really interested into this project, actually I am trying to make the RTVC work in Spanish first, in order to do exactly this for next semester (I'm a bachelor student of computer science at Universidad Nacional de Ingenieria, Peru). I made a conversational web last year, so maybe you could use something like this to quit the program:

while 1:
        user_input=""
        user_input=vi.record_audio("")
        vi.respond(user_input)
        if ("bye" in user_input) or ("goodbye" in user_input) or ("see you later" in user_input) or ...(here you may put all the options for saying goodbye):
            break

In this code, first the user tells the assistant anything and the assistant responds. When the user says "goodbye", the assistant answers and it stops hearing, here you may quit the program. The "record_audio" function is basically a speech to text, so "user_input" is the text input you enter to the GPT. I hope you publish your work, so next semester I can use it as a reference!

Andredenise commented 3 years ago

It's great to hear someone else had the same idea. Before closing the program I had thought of a similar thing.

I still experience trouble converting my audio to text. Currently my code for the audio imput and converting to text looks like this.

sampling_rate = 44100
duration = 5  # seconds

# Start recording the user
original_wav = sd.rec(duration*sampling_rate, sampling_rate, 1)

# Display a progress bar while recording
# Get current time to keep track of recording length
start_time = time()
print("Press enter to stop recording")
input()  # This blocks the program from continuing until user presses enter

# Trim wav to actual length of recording
recording_length = time() - start_time
if recording_length < duration:
    original_wav = original_wav[:int(recording_length*sampling_rate)]

def there_exists(terms):
    for term in terms:
        if term in voice_data:
            return True

r = sr.Recognizer() # initialise a recogniser
# listen for audio and convert it to text:
def audio_to_text(ask=False):
    with sd.play(original_wav, sampling_rate) as source: # wav as source
    #with sr.AudioFile('recording.wav') as source: # File as source
        if ask:
            print(ask)
        audio = r.listen(source)  # listen for the audio via source
        voice_data = ''
        try:
            voice_data = r.recognize_google(audio)  # convert audio to text
        except sr.UnknownValueError: # error: recognizer does not understand
            print('I did not get that')
        except sr.RequestError:
            print('Sorry, the service is down') # error: recognizer is not connected
        print(f">> {voice_data.lower()}") # print what user said
        return voice_data.lower()

voice_data = audio_to_text() # get the voice input
print(voice_data)

My program only has problems listening to the audio and converting it. Is it possible to help me with this? Thanks in advance!

AlexSteveChungAlvarez commented 3 years ago

https://github.com/AlexChungA/Jarvis/blob/master/voice_interact.py this is the code I used last year (I took it from another code, which I don't remember right now where it is), your audio_to_text function is pretty similar to the record_audio function, in the code I provide here, I remember the speech to text worked fine. Maybe it helps. By the way, I will be back in 3 hours, if you still have problems then, I can work with you on it if needed.

AlexSteveChungAlvarez commented 3 years ago

@Andredenise how can I contact you? I am ready to help.

Andredenise commented 3 years ago

first of all thanks a lot for the link. You can always contact me via this mail: r0673226@student.luca-arts.be, it is currently nighttime here so I will contact you tomorrow if you already send me a short message by mail.