DigitalPhonetics / IMS-Toucan

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.
Apache License 2.0
1.17k stars 135 forks source link

Clone a Voice. How to improve? #148

Closed vikolaz closed 2 weeks ago

vikolaz commented 1 year ago
import os
import torch
from InferenceInterfaces.ToucanTTSInterface import ToucanTTSInterface

if __name__ == '__main__':
    tts = ToucanTTSInterface(device="cuda" if torch.cuda.is_available() else "cpu", tts_model_path="Meta", language="it")

    input_text = "my text to say"

    # Loop through the speaker reference audio files in the folder
    speaker_reference_folder = "input/folder"
    for file_name in os.listdir(speaker_reference_folder):
        if file_name.endswith('.wav'):
            speaker_reference = os.path.join(speaker_reference_folder, file_name)

            # Set the speaker embedding to clone the voice
            tts.set_utterance_embedding(speaker_reference)

            # Synthesize speech with the cloned voice
            output_file_name = "audios/cloned_voice.wav"
            tts.read_to_file(text_list=[input_text], file_location=output_file_name)

    del tts

I used this method to clone a voice, the result is something similar to the original voice but i guess it can improve.

There's a different/better approach to do this? How big my dataset should be?

Now I've used like 7 sample of 1 minute each

Thanks

vikolaz commented 1 year ago

Actually look like it only take in cosidaration one file even if I give more as input

Ca-ressemble-a-du-fake commented 1 year ago

If set_utterance_embedding does not give you satisfaction I think you have to fine tune the model (Meta) properly on your dataset.

If you want to only use speaker reference 6 to 12 seconds is enough (a single one per speaker as you guessed already).

vikolaz commented 1 year ago

Do you know of any specific guides or tutorials that provide step-by-step instructions on how to perform fine-tuning with ToucanTTS?

i'm a beginner at coding :)

Ca-ressemble-a-du-fake commented 1 year ago

I am currently writing one but it is in French and still not finished since it is one of my many side projects !

Yet if you carefuly follow this project readme file "quietly" 😉 https://github.com/DigitalPhonetics/IMS-Toucan#build-a-toucantts-pipeline you will be able to fine tune Meta model on your dataset (you'll probably need more data than 7 minutes of audio). The instructions are really sufficient and there are already examples in the files to be modified to guide you.

Please mind that you need to create your dataset with the transcription of each audio sample (10 sec max). Ask ChatGPT how to generate a dataset for tts training it will give you advice if you need some. You'll also need a GPU for the training.

thoraxe commented 1 year ago

@vikolaz I have not officially published this yet, but:

https://github.com/OpenShiftDemos/ToucanTTS-RHODS-voice-cloning

Flux9665 commented 11 months ago

A small update on this: Zero-shot voice cloning is being worked on right now. It does not sound good yet and I've already put multiple months into this. But hopefully with the next version, the model can be used to speak in an unseen voice much better even without finetuning and everything will be a bit simpler.

adhikjoshi commented 1 week ago

A small update on this: Zero-shot voice cloning is being worked on right now. It does not sound good yet and I've already put multiple months into this. But hopefully with the next version, the model can be used to speak in an unseen voice much better even without finetuning and everything will be a bit simpler.

Is it fixed in new release?

Flux9665 commented 1 week ago

In the new release, voice cloning is definitely much better than it was, but there is still plenty of room to improve. I'll make an English-Only checkpoint in the next few weeks that's going to be focussed on speaker adaptation.

adhikjoshi commented 1 week ago

In the new release, voice cloning is definitely much better than it was, but there is still plenty of room to improve. I'll make an English-Only checkpoint in the next few weeks that's going to be focussed on speaker adaptation.

Can you also share training info afterwards, so would like to train for other languages (voice cloning)