Open JL17779 opened 1 year ago
Hi
Yes, I believe there is. My hypothesis is that the conditional_latents model that one trains and that outputs the .pth file you put in the voices folder (instead of just 2-4 samples of voices) "defaults" back to the pre-configured latents it was trained on, that probably are American. I have a perfectly accurate British-speaking voice, but it was "conditioned" on over 8.000 2-5 second snippets of audio from a professional narrator. It really makes a big difference. Simply create a folder in the "voices" directory, and put all your audio clips there. It will just take the first 4,1 seconds, so no need to make them longer. and go to the scripts directory and run python get_conditioning_latents.py --voice [your voice folder name]
I just tried with a person with an Australian accent and a very distinct intonation of words, in particular at the end of a sentence. 378 clips were not nearly enough!
Maybe you can find a free audiobook with, say 9-13 hours of audio you can chop. I can recommend using the librosa "split" library.
NOTE: If there is a GPU on your machine, and you have thousands of clips, it can make the GPU run out of memory. In that case, use a machine without a GPU and lots of memory. It will take longer, but it will run. On a 30 CPU datacenter server with 200GB of memory (not all was used) it took my 8k files model around 30 minutes to generate, so if you are doing it on your own PC, let it run over night.
Hi! I'm travelling a similar path, trying to capture accents. I've managed to produce conditioning latent but I'm unsure how to use them in generation. Do we just put the .pth file alone on a new voice folder? I tried this with not great results.
I have noticed that setting the diffusion sampler to "p" as well as adjusting the preset to very fast, fast, or standard is a quick method for working with accents (of any kind). DDIM seems to work much better for "american" style accents.
I use DLAS and it outputs a pth file. I then put the pth file in a directory within TTS and refer to that checkpoint. It seems to capture accents pretty well. Is your method different from what DLAS is doing?
Hi guys. I just went through the simple steps. Place a ridiculous amount of clips in a new folder. The folder named the speaker name I want to use, and ran python get_conditial_latents.py --voice [the voice name] Again I had thousands of clips. It really made a difference. It works on all the different kids, e.g. the original, and BigVGAN, as well as on all qualities from ultra_fast to high-quality. Perhaps the key is, that my speaker, a professional narrator, was very consistent in their use of language. I'm sorry I cannot share. I used licenced material (audiobooks). It is a private test, but It would not be appropriate to share it -perhaps not even legal. Going down the path of using non-licensed material for training (like "some currently very popular GPT-driven app" is doing is not cool. I have run into a very real dilemma. The voice is very clearly the speakers voice (95% there). As narrators make their money narrating (obviously), "stealing" their voice and style may directly affect them financially, and taking their voice is a bit like if someone would create a mask of my face, wear it around, and claim they were called iloop001, instead of eloop001. I would not like that. It would feel like someone stole something very personal from me. I would still advice to just use a LOT of consistent material and wait for the training to complete. As I said, it runs perfectly well on CPU only, but will kill the GPU's memory if you have a GPU in your system.
@JoseEliel : Simple steps:
that's it..
Great tutorial :)
I have trouble at step 4 - any ideas?
root@C.6077885:~/tortoise-tts-fast/scripts$ python3 get_conditioning_latents.py --voice clinton
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/tortoise-tts-fast/scripts/get_conditioning_latents.py:48 in
Oh and when I use one of the existing directories (and switch files) and run the get_conditioning_latents.py there is a small (13kB) pth file being produced but it makes no difference. The old model is still being used for any generation (I use ultra-fast).
Last note, (Slow) Tortoise uses these files just fine but I haven't been been able to get the latents script to work there either.
That should be it. Please don't take what I write verbatim, but I hope you get the point. I had great results with around 12 hours of audio.
cd /tortoise-tts-fast/tortoise/scripts
then try these: python3 tortoise_tts.py --preset high_quality --voice fancyvoice --sampler ddim --vocoder Univnet --voicefixer false --cond_free true <some_doc_with_text.txt
or if you want to run ultra fast:
python3 tortoise_tts.py --preset ultra_fast --voice fancyvoice--sampler p --vocoder BigVGAN --voicefixer true --cond_free true --top_p 1 --diffusion_temperature 0.95 <some_doc_with_text.txt
I ended up outputting the dictionary in get_conditioning_latents.py. It shows the path information - for me the file path was something strange with /../ - any changes in /tortoise-tts-fast/tortoise/voices would be ignored. Maybe it's my file system or installation or someone made changes.
I ended up training a few new voices with 5 audio files (as suggested in original Tortoise). That worked without OoM issues but got OoM issues at 10 files (despite 24GB vRAM).
Using CPU only and plenty RAM is a great workaround but seems silly for these tiny files and small model sizes.
It sounds wrong it would take up that much vram. If you look in the source you will see you can set the option --latent_averaging_mode It should be set to 0 and that is the default of you don't set it, and it will only take the 4.something secs of audio from each file. Anyways it just took me around 2 hours with full throttle on CPUs and 128 GB mem. IDK, maybe if you dig deeper in the call stack you may find it is able to identify available mem and batch the audio in smaller chunks. If not, that might be a change request you can make (or implement something yourself?).
Is there a way to amend this?