Closed Andredenise closed 2 years ago
This is what you are looking for: demo_cli.py
Running this in a other script will automatically start the real time cloning without any inputs to give?
No, it will require modifications. Read the comments and adapt it to suit your needs.
okay thanks!
Is there someone who is able to do this modifications, because I think this is out of my league
Can you describe how you would want it to work? We can try to improve the interface.
For example:
import sounddevice as sd
from real_time_voice_cloning import RTVC
voice = RTVC() #performs initialization using config file
voice.clone("target_voice.wav")
audio = voice.text_to_speech(["Hello, this is your voice assistant.", "What can I do for you today?"])
sd.play(audio)
My project is actually a kind of speech assistant that is based on the open Ai GPT 2 model with who you can have a conversation about a specific topic. The problem for me is that the Open Ai script of the model is not a text to speech software so it responds with text. My idea was to implement this script in the model so that you actually can have a conversation "with yourself" because of the cloning tool. So the real-time-voice-cloning tool should actually be my way of outputting the results of the open Ai model. I hope this is kind of clear?
I think what you want is:
My questions are:
First of all, my apologies for the lack of clarity in my project. I am a master student graphic design at luca school of arts in Ghent, Belgium. This project is actually my thesis and therefore not a paying project, but very important to me. The 4 steps are completely what I want to achieve. At the moment this is a private project but if it works, it can certainly become open source for me. Concerning the gui or web interface, this is definitely not a priority for me, if this project is able to work, it is fine For me to run it in the prompt, as the focus is on speaking and conversation. Thank you very much!
Here is a version of demo_cli.py
that doesn’t require any user interaction.
https://github.com/blue-fish/Real-Time-Voice-Cloning/blob/890_non_interactive_generation/demo_cli.py
You would run it like this. gpt_output.txt
can have multiple lines. It is recommended to add a line break after each sentence, because the synthesizer struggles when the inputs become too long.
python demo_cli.py --skip_tests --no_save original_voice.wav gpt_output.txt
If the rest of your project uses Python, it is possible to use parts of demo_cli.py to do the generation without having to use Real-Time-Voice-Cloning separately like this. However, that is something you will have to figure out on your own. Good luck with your project!
It's still a bit unclear to me. Below you will find the script for the interactive form of the gpt2 model that I use. How do I integrate your forwarded version into it?
`import fire import json import os import numpy as np import tensorflow as tf
import model, sample, encoder
def interact_model(
model_name='124M',
seed=None,
nsamples=1,
batch_size=1,
length=None,
temperature=1,
top_k=0,
top_p=1,
models_dir='models',
):
"""
Interactively run the model
:model_name=124M : String, which model to use
:seed=None : Integer seed for random number generators, fix seed to reproduce
results
:nsamples=1 : Number of samples to return total
:batch_size=1 : Number of batches (only affects speed/memory). Must divide nsamples.
:length=None : Number of tokens in generated text, if None (default), is
determined by model hyperparameters
:temperature=1 : Float value controlling randomness in boltzmann
distribution. Lower temperature results in less random completions. As the
temperature approaches zero, the model will become deterministic and
repetitive. Higher temperature results in more random completions.
:top_k=0 : Integer value controlling diversity. 1 means only 1 word is
considered for each step (token), resulting in deterministic completions,
while 40 means 40 words are considered at each step. 0 (default) is a
special setting meaning no restrictions. 40 generally is a good value.
:models_dir : path to parent folder containing model subfolders
(i.e. contains the
enc = encoder.get_encoder(model_name, models_dir)
hparams = model.default_hparams()
with open(os.path.join(models_dir, model_name, 'hparams.json')) as f:
hparams.override_from_dict(json.load(f))
if length is None:
length = hparams.n_ctx // 2
elif length > hparams.n_ctx:
raise ValueError("Can't get samples longer than window size: %s" % hparams.n_ctx)
with tf.Session(graph=tf.Graph()) as sess:
context = tf.placeholder(tf.int32, [batch_size, None])
np.random.seed(seed)
tf.set_random_seed(seed)
output = sample.sample_sequence(
hparams=hparams, length=length,
context=context,
batch_size=batch_size,
temperature=temperature, top_k=top_k, top_p=top_p
)
saver = tf.train.Saver()
ckpt = tf.train.latest_checkpoint(os.path.join(models_dir, model_name))
saver.restore(sess, ckpt)
while True:
raw_text = input("Model prompt >>> ")
while not raw_text:
print('Prompt should not be empty!')
raw_text = input("Model prompt >>> ")
context_tokens = enc.encode(raw_text)
generated = 0
for _ in range(nsamples // batch_size):
out = sess.run(output, feed_dict={
context: [context_tokens for _ in range(batch_size)]
})[:, len(context_tokens):]
for i in range(batch_size):
generated += 1
text = enc.decode(out[i])
print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
print(text)
print("=" * 80)
if name == 'main': fire.Fire(interact_model)`
Integrate this code with your script instead. It assumes you have the Real-Time-Voice-Cloning repository at the same level as your script.
https://gist.github.com/blue-fish/ecabbca4f1a69701d32852f9f446c077
I've only been working with python for a few months, so I'm not very good at programming yet. You had previously given a good summary of what the program should look like as a whole. Namely like this:
Is it possible to give me an overview of how the demo_cli.py (new version you sent) and the demo_rtvc.py should now work with the gpt-2 model. What should the structure look like as a whole? My apologies for the many questions.
Is it possible to give me an overview of how the demo_cli.py (new version you sent) and the demo_rtvc.py should now work with the gpt-2 model.
The code in demo_rtvc.py is structured like this:
## Section 1: Setup
# Imports
# Load models
# Use encoder to make speaker embedding
## Section 2: Generation
# Use synthesizer to make mel spectrograms from text
# Use vocoder to make waveform audio from mels
# Play waveform audio
You put Section 1 into the beginning of your script.
You have a while loop which uses GPT to make a variable called text
. This needs to be split into a list of sentences called texts
. After that, you can insert the code from Section 2 to generate and play back the audio.
What should the structure look like as a whole?
To build the program I outlined, you have to figure out what each of the sections is doing. This will help you put the parts together.
## Setup
# Put all the imports here.
## 1. Record the user with the computer's microphone
# Inputs: None
# Outputs: An audio waveform (as a numpy array)
#
# The code needs to initialize the input audio device and capture the audio.
# This is how it is done in the toolbox:
# https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/toolbox/ui.py#L219
## 2. Use automatic speech recognition to determine what the user said
# Inputs: Audio waveform from step 1
# Outputs: Text
#
# First, the code needs to set up your ASR package.
# Next, pass the audio to ASR to get the user's transcribed speech.
## 3. Input to GPT-2 and get a response (text)
# Inputs: Text (transcribed speech from user in step 2)
# Outputs: Text
#
# Your script already does this.
## 4. Provide text and voice recording to Real-Time-Voice-Cloning to generate audio in the user's voice and play it on speakers or headphones
# Inputs: Audio waveform from step 1
#. Text (GPT output from step 3)
#
# Outputs: None
#
# Most of these functions are demonstrated in demo_rtvc.py
# a. This section first creates a speaker embedding from the recording in step 1.
# b. You may also want to break up the GPT text into individual sentences.
# c. Input the embedding and text to the synthesizer, to get a mel spectrogram.
# d. Give the mel spectrogram to the vocoder to get an audio waveform (as a numpy array)
# e. Play back the audio through the computer's output sound device.
My apologies for the many questions.
No need to apologize, but please understand that it takes away from my primary focus of improving the repo's codebase and model quality. I normally don't answer questions or provide help with personal projects, because I like to work on things that help all users. In this case, I made an exception because I remember what it is like to be new to programming and know that having a little guidance can make an impact.
Take some time to consider what I have provided here, and try to start building the program. Since I provided demo_rtvc.py
, you can start at the end and work backwards. The development steps could look like this.
demo_rtvc.py
in Python to make sure it is set up properly.Please try to make use of other support channels (Stack Overflow, etc.) whenever possible. And consider releasing your project as open source when it is done.
I cannot explain how much (thinking) work this saves me. This is very useful and clear! Thank you so much for helping me so hard!! I am always willing to share my final result if it would be interesting and useful.
Just wanted to +100 that this discussion has been a very useful read for me and my project as well. Thanks @blue-fish !
hey
I still have 1 structural problem with my project that I cannot solve. In order for the Voice_cloner to work I have to enter a prerecorded wav sample in this code.
wav_path = "samples/p240_00000.mp3" original_wav, sampling_rate = librosa.load(wav_path) preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate) embed = encoder.embed_utterance(preprocessed_wav) print("Created the embedding")
Currently my code looks like this, but it doesn't work.
` import speech_recognition as sr import sounddevice as sd import numpy as np import os from scipy.io.wavfile import write from time import sleep import soundfile as sf import matplotlib.pyplot as plt from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas from matplotlib.figure import Figure from PyQt5.QtCore import Qt, QStringListModel from PyQt5.QtWidgets import * from pathlib import Path from typing import List, Set import umap import sys import pyaudio import wave from pathlib import Path import numpy as np import soundfile as sf import librosa import argparse import torch from audioread.exceptions import NoBackendError
colormap = np.array([ [0, 127, 70], [255, 0, 0], [255, 217, 38], [0, 135, 255], [165, 0, 165], [255, 167, 255], [97, 142, 151], [0, 255, 255], [255, 96, 38], [142, 76, 0], [33, 0, 127], [0, 0, 0], [183, 183, 183], [76, 255, 0], ], dtype=np.float) / 255
def record_one(self, sample_rate, duration):
self.log("Recording %d seconds of audio" % duration)
sd.stop()
try:
wav = sd.rec(duration * sample_rate, sample_rate, 1)
except Exception as e:
print(e)
self.log("Could not record anything. Is your recording device enabled?")
return None
for i in np.arange(0, duration, 0.1):
self.set_loading(i, duration)
sleep(0.1)
self.set_loading(duration, duration)
sd.wait()
self.log("Done recording.")
wav_file_name = r"C:\Users\thoma\OneDrive\Documenten\Master\Scriptie\AI Writer\Final\Final_mapje\Speakerrecordings\recorded_voice.wav"
if os.path.isfile(wav_file_name):
expand = 1
while True:
expand += 1
new_wav_file_name = wav_file_name.split(".wav")[0] + str(expand) + ".wav"
if os.path.isfile(new_wav_file_name):
continue
else:
wav_file_name = new_wav_file_name
break
voice_file = write(wav_file_name, sample_rate, wav.astype(np.float32)) # Save as WAV file
#return wav.squeeze()
return voice_file
while(1): voice_file = record_one() # get the voice file `
I was wondering if it was possible to set the sample_rate and duration yourself or if they depend on the other encodings? Is it possible to help me fix this issue?
Here is the embedding code from demo_rtvc.py with comments added.
# This part loads a wav file from disk
wav_path = "samples/p240_00000.mp3"
original_wav, sampling_rate = librosa.load(wav_path)
# This part creates the embedding
preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
embed = encoder.embed_utterance(preprocessed_wav)
Now that you're ready to record the user through the microphone, the first part isn't needed. However, you still need to provide original_wav
and sampling_rate
. The sample rate is user-specified. The wav comes from sounddevice's rec() function.
For recording, you only need a single line of code. Here is a minimal example that uses it.
# Add these imports as needed at the top of your program
import sounddevice as sd
from tqdm import trange
from time import sleep, time
# Choose a sample rate that is compatible with your hardware
sampling_rate = 44100
duration = 5 # seconds
# Start recording the user
original_wav = sd.rec(duration*sampling_rate, sampling_rate, 1)
# Display a progress bar while recording
# This also blocks the next part of the code from running before recording is complete
for i in trange(duration):
sleep(1)
# This part creates the embedding
preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
embed = encoder.embed_utterance(preprocessed_wav)
If you don't want to record for a fixed duration, you can replace the "display a progress bar" section with this code.
# Get current time to keep track of recording length
start_time = time()
print("Press enter to stop recording")
input() # This blocks the program from continuing until user presses enter
# Trim wav to actual length of recording
recording_length = time() - start_time
if recording_length < duration:
original_wav = original_wav[:int(recording_length*sampling_rate)]
First of all, thank you for replying so quickly!
If I understand correctly I should put this part in the beginning of my program:
original_wav = sd.rec(duration*sampling_rate, sampling_rate, 1)
, along with the encoding for duration and sample_rate? I was wondering if I didn't need an if statement that checks if the wav_path already exists and if so creates a new file so that it can run in the large while loop of the program? Or does the voice_cloning only need 1 sample wav to reconstruct the sound?
Thank you in advance!
All 4 parts in the program outline can be inside the while loop. This will cause the speaker embedding to be generated on each iteration using the latest voice recording. It might be a nice feature so the program does not need to be restarted when switching between different human participants.
If this is not what you want, you can prevent the embed from being updated, by initializing it to None
and running the speaker encoder if it doesn't already exist. It would look something like this:
# This goes in the setup portion of the script
embed = None
while program_is_running:
## 1. Record the user with the computer's microphone
original_wav = sd.rec(duration*sampling_rate, sampling_rate, 1)
## 2. Use automatic speech recognition to determine what the user said
## 3. Input to GPT-2 and get a response (text)
## 4. Provide text and voice recording to Real-Time-Voice-Cloning to generate audio in the user's voice and play it on speakers or headphones
if not embed: # This makes it run only on the first iteration of the while loop
preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
embed = encoder.embed_utterance(preprocessed_wav)
If you have been following the development steps in order, you should have a working speech recognition and voice cloning, which both start by loading a wav file from disk into memory. The code I provided records the user's voice directly into memory so you do not need to check for a wav_path or create any files.
Okay super! I have currently chosen to put all 4 parts in the while loop, so I will continue to work on that as it is also interesting for several participants. Most of it is done, but I'm still thinking about a quit option to quit the program, since pressing a button doesn't seem that interesting in a conversation with the program. Huge thanks for the help!
I'm really interested into this project, actually I am trying to make the RTVC work in Spanish first, in order to do exactly this for next semester (I'm a bachelor student of computer science at Universidad Nacional de Ingenieria, Peru). I made a conversational web last year, so maybe you could use something like this to quit the program:
while 1:
user_input=""
user_input=vi.record_audio("")
vi.respond(user_input)
if ("bye" in user_input) or ("goodbye" in user_input) or ("see you later" in user_input) or ...(here you may put all the options for saying goodbye):
break
In this code, first the user tells the assistant anything and the assistant responds. When the user says "goodbye", the assistant answers and it stops hearing, here you may quit the program. The "record_audio" function is basically a speech to text, so "user_input" is the text input you enter to the GPT. I hope you publish your work, so next semester I can use it as a reference!
It's great to hear someone else had the same idea. Before closing the program I had thought of a similar thing.
I still experience trouble converting my audio to text. Currently my code for the audio imput and converting to text looks like this.
sampling_rate = 44100
duration = 5 # seconds
# Start recording the user
original_wav = sd.rec(duration*sampling_rate, sampling_rate, 1)
# Display a progress bar while recording
# Get current time to keep track of recording length
start_time = time()
print("Press enter to stop recording")
input() # This blocks the program from continuing until user presses enter
# Trim wav to actual length of recording
recording_length = time() - start_time
if recording_length < duration:
original_wav = original_wav[:int(recording_length*sampling_rate)]
def there_exists(terms):
for term in terms:
if term in voice_data:
return True
r = sr.Recognizer() # initialise a recogniser
# listen for audio and convert it to text:
def audio_to_text(ask=False):
with sd.play(original_wav, sampling_rate) as source: # wav as source
#with sr.AudioFile('recording.wav') as source: # File as source
if ask:
print(ask)
audio = r.listen(source) # listen for the audio via source
voice_data = ''
try:
voice_data = r.recognize_google(audio) # convert audio to text
except sr.UnknownValueError: # error: recognizer does not understand
print('I did not get that')
except sr.RequestError:
print('Sorry, the service is down') # error: recognizer is not connected
print(f">> {voice_data.lower()}") # print what user said
return voice_data.lower()
voice_data = audio_to_text() # get the voice input
print(voice_data)
My program only has problems listening to the audio and converting it. Is it possible to help me with this? Thanks in advance!
https://github.com/AlexChungA/Jarvis/blob/master/voice_interact.py this is the code I used last year (I took it from another code, which I don't remember right now where it is), your audio_to_text function is pretty similar to the record_audio function, in the code I provide here, I remember the speech to text worked fine. Maybe it helps. By the way, I will be back in 3 hours, if you still have problems then, I can work with you on it if needed.
@Andredenise how can I contact you? I am ready to help.
first of all thanks a lot for the link. You can always contact me via this mail: r0673226@student.luca-arts.be, it is currently nighttime here so I will contact you tomorrow if you already send me a short message by mail.
This is a lovely program, but I`m searching for a way to implement the voice cloning in my other script. It should run automatically without using the toolbox. So that it can be used as a speech assistent that sounds the way I want. Is there a way to do this? Thanks!