152334H / tortoise-tts-fast

Fast TorToiSe inference (5x or your money back!)
GNU Affero General Public License v3.0
759 stars 177 forks source link

American accent for an English accent speaker? #56

Open JL17779 opened 1 year ago

JL17779 commented 1 year ago

Is there a way to amend this?

eloop001 commented 1 year ago

Hi

Yes, I believe there is. My hypothesis is that the conditional_latents model that one trains and that outputs the .pth file you put in the voices folder (instead of just 2-4 samples of voices) "defaults" back to the pre-configured latents it was trained on, that probably are American. I have a perfectly accurate British-speaking voice, but it was "conditioned" on over 8.000 2-5 second snippets of audio from a professional narrator. It really makes a big difference. Simply create a folder in the "voices" directory, and put all your audio clips there. It will just take the first 4,1 seconds, so no need to make them longer. and go to the scripts directory and run python get_conditioning_latents.py --voice [your voice folder name]

I just tried with a person with an Australian accent and a very distinct intonation of words, in particular at the end of a sentence. 378 clips were not nearly enough!

Maybe you can find a free audiobook with, say 9-13 hours of audio you can chop. I can recommend using the librosa "split" library.

NOTE: If there is a GPU on your machine, and you have thousands of clips, it can make the GPU run out of memory. In that case, use a machine without a GPU and lots of memory. It will take longer, but it will run. On a 30 CPU datacenter server with 200GB of memory (not all was used) it took my 8k files model around 30 minutes to generate, so if you are doing it on your own PC, let it run over night.

JoseEliel commented 1 year ago

Hi! I'm travelling a similar path, trying to capture accents. I've managed to produce conditioning latent but I'm unsure how to use them in generation. Do we just put the .pth file alone on a new voice folder? I tried this with not great results.

blasphemousjohn commented 1 year ago

I have noticed that setting the diffusion sampler to "p" as well as adjusting the preset to very fast, fast, or standard is a quick method for working with accents (of any kind). DDIM seems to work much better for "american" style accents.

tanfarou commented 1 year ago

I use DLAS and it outputs a pth file. I then put the pth file in a directory within TTS and refer to that checkpoint. It seems to capture accents pretty well. Is your method different from what DLAS is doing?

eloop001 commented 1 year ago

Hi guys. I just went through the simple steps. Place a ridiculous amount of clips in a new folder. The folder named the speaker name I want to use, and ran python get_conditial_latents.py --voice [the voice name] Again I had thousands of clips. It really made a difference. It works on all the different kids, e.g. the original, and BigVGAN, as well as on all qualities from ultra_fast to high-quality. Perhaps the key is, that my speaker, a professional narrator, was very consistent in their use of language. I'm sorry I cannot share. I used licenced material (audiobooks). It is a private test, but It would not be appropriate to share it -perhaps not even legal. Going down the path of using non-licensed material for training (like "some currently very popular GPT-driven app" is doing is not cool. I have run into a very real dilemma. The voice is very clearly the speakers voice (95% there). As narrators make their money narrating (obviously), "stealing" their voice and style may directly affect them financially, and taking their voice is a bit like if someone would create a mask of my face, wear it around, and claim they were called iloop001, instead of eloop001. I would not like that. It would feel like someone stole something very personal from me. I would still advice to just use a LOT of consistent material and wait for the training to complete. As I said, it runs perfectly well on CPU only, but will kill the GPU's memory if you have a GPU in your system.

@JoseEliel : Simple steps:

  1. Get LOTS of samples of the voice in max 4.1 second duration.
  2. Create a new folder under /tortoise/voices/[name of your voice]
  3. Place the clips in that directory.
  4. Run /scripts/[get_conditioning_latents.py --voice [your voice]
  5. Wait a long time. No progress bar will appear.
  6. There will be a file called [name of your voice].pth in the scripts/results/latents/ folder. This folder will be created automatically.
  7. Delete all the audio files in the folder from step 2.
  8. Move the .pth file to the folder.

that's it..

bluusun commented 1 year ago

Great tutorial :)

I have trouble at step 4 - any ideas?

root@C.6077885:~/tortoise-tts-fast/scripts$ python3 get_conditioning_latents.py --voice clinton ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /root/tortoise-tts-fast/scripts/get_conditioning_latents.py:48 in │ │ │ │ 45 │ │ 46 │ │ 47 if name == "main": │ │ ❱ 48 │ main() │ │ 49 │ │ │ │ /usr/local/lib/python3.10/site-packages/simple_parsing/decorators.py:130 in _wrapper │ │ │ │ 127 │ │ │ keywords = collections.ChainMap(kwargs, other_kwargs) │ │ 128 │ │ │ │ │ 129 │ │ │ # Call the function │ │ ❱ 130 │ │ │ return function(*positionals, **keywords) │ │ 131 │ │ │ │ 132 │ │ return _wrapper │ │ 133 │ │ │ │ /root/tortoise-tts-fast/scripts/get_conditioning_latents.py:36 in main │ │ │ │ 33 │ voices = get_voices() │ │ 34 │ selected_voices = voice.split(",") │ │ 35 │ for voice in selected_voices: │ │ ❱ 36 │ │ cond_paths = voices[voice] │ │ 37 │ │ conds = [] │ │ 38 │ │ for cond_path in cond_paths: │ │ 39 │ │ │ c = load_required_audio(cond_path) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ KeyError: 'clinton' root@C.6077885:~/tortoise-tts-fast/scripts$ ls -lh ../tortoise/voices/clinton/ total 2.2M -rw-r--r-- 1 root root 431K Mar 29 23:07 output81.wav -rw-r--r-- 1 root root 431K Mar 29 23:07 output82.wav -rw-r--r-- 1 root root 431K Mar 29 23:07 output83.wav -rw-r--r-- 1 root root 431K Mar 29 23:07 output84.wav -rw-r--r-- 1 root root 430K Mar 29 23:07 output85.wav

bluusun commented 1 year ago

Oh and when I use one of the existing directories (and switch files) and run the get_conditioning_latents.py there is a small (13kB) pth file being produced but it makes no difference. The old model is still being used for any generation (I use ultra-fast).

bluusun commented 1 year ago

Last note, (Slow) Tortoise uses these files just fine but I haven't been been able to get the latents script to work there either.

eloop001 commented 1 year ago
  1. cd /tortoise-tts-fast/tortoise/voices
  2. mkdir fancyname
  3. cd fancyname
  4. place all your files here - remember that they have to be able to all fit into GPU mem, so I usually grab a cheap 30 CPU machine with 128GB mem on google cloud because when there is no GPU, the script will actually just use (ALL) of the CPUs and most of the memory.
  5. cd /tortoise-tts-fast/tortoise/scripts
  6. python get_conditioning_latents.py --voice fancyname After some hours a .pth file will appear in /tortoise-tts-fast/tortoise/scripts/results/fancyname 7.rm /tortoise-tts-fast/tortoise/voices/fancyname/*
  7. mv /tortoise-tts-fast/tortoise/scripts/results/fancyname/fancyname.pth /tortoise-tts-fast/tortoise/voices/fancyname/

That should be it. Please don't take what I write verbatim, but I hope you get the point. I had great results with around 12 hours of audio.

cd /tortoise-tts-fast/tortoise/scripts

then try these: python3 tortoise_tts.py --preset high_quality --voice fancyvoice --sampler ddim --vocoder Univnet --voicefixer false --cond_free true <some_doc_with_text.txt

or if you want to run ultra fast:

python3 tortoise_tts.py --preset ultra_fast --voice fancyvoice--sampler p --vocoder BigVGAN --voicefixer true --cond_free true --top_p 1 --diffusion_temperature 0.95 <some_doc_with_text.txt

bluusun commented 1 year ago

I ended up outputting the dictionary in get_conditioning_latents.py. It shows the path information - for me the file path was something strange with /../ - any changes in /tortoise-tts-fast/tortoise/voices would be ignored. Maybe it's my file system or installation or someone made changes.

I ended up training a few new voices with 5 audio files (as suggested in original Tortoise). That worked without OoM issues but got OoM issues at 10 files (despite 24GB vRAM).

Using CPU only and plenty RAM is a great workaround but seems silly for these tiny files and small model sizes.

eloop001 commented 1 year ago

It sounds wrong it would take up that much vram. If you look in the source you will see you can set the option --latent_averaging_mode It should be set to 0 and that is the default of you don't set it, and it will only take the 4.something secs of audio from each file. Anyways it just took me around 2 hours with full throttle on CPUs and 128 GB mem. IDK, maybe if you dig deeper in the call stack you may find it is able to identify available mem and batch the audio in smaller chunks. If not, that might be a change request you can make (or implement something yourself?).