JarodMica / StyleTTS-WebUI

MIT License
51 stars 18 forks source link

Functional multi-file combination method for longer prompts. #28

Closed JonSingleton closed 1 month ago

JonSingleton commented 2 months ago

EDIT I created a pull request to implement this a little bit cleaner (or at the very least Jarod can rework it and implement the way he sees best).

So this code is modified from the NeuralVox readme for longer prompts and utilizes tortoiseTTS's split_and_recombine_text function. I sloppily copied the tortoise site-package from my copy of your ai voice cloning v3 repo into the styletts2 webui site packages venv (I didn't install it, didn't wanna deal with dependency conflicts, and am only using it for the split_and_recombine_text function) and worked it into your generate_audio function in webui.py. Ugly code below. direct replacement for the generate_audio function.

Someone who is better with python, please do a better/prettier job of implementing this (or just implementing the specific split function mentioned into your code directly?) and do a pull request (or Jarod himself, of course).

That said, I've tested, and was able to generate 3 minutes of audio in 16 seconds - that's just the longest I tried, I couldn't tell you what (if any) limit there would really be.

Feel free to move this to discussions if you think it belongs there instead.

`from tortoise.utils.text import split_and_recombine_text import numpy as np from scipy.io.wavfile import write

original_seed = int(seed)
reference_audio_path = os.path.join(voices_root, voice, reference_audio_file)
reference_dicts = {f'{voice}': f"{reference_audio_path}"}
# noise = torch.randn(1, 1, 256).to(device)
start = time.time()
if original_seed==-1:
    seed_value = random.randint(0, 2**32 - 1)
else:
    seed_value = original_seed
set_seeds(seed_value)
for k, path in reference_dicts.items():
    mean, std = -4, 4
    # print(f'model:{model}')
    ref_s = compute_style(path, model, to_mel, mean, std, device)

    texts = split_and_recombine_text(text)
    audios = []

    # wav1 = inference(text, ref_s, model, sampler, textcleaner, to_mel, device, model_params, global_phonemizer=global_phonemizer, alpha=alpha, beta=beta, diffusion_steps=diffusion_steps, embedding_scale=embedding_scale)
    for t in texts:
        audios.append(inference(t, ref_s, model, sampler, textcleaner, to_mel, device, model_params, global_phonemizer=global_phonemizer, alpha=alpha, beta=beta, diffusion_steps=diffusion_steps, embedding_scale=embedding_scale))

    rtf = (time.time() - start)
    print(f'inference({text}, {ref_s}, {model}, {sampler}, {textcleaner}, {to_mel}, {device}, {model_params}, global_phonemizer={global_phonemizer}, alpha={alpha}, beta={beta}, diffusion_steps={diffusion_steps}, embedding_scale={embedding_scale})')
    print(f"RTF = {rtf:5f}")
    print(f"{k} Synthesized:")
    os.makedirs("results", exist_ok=True)
    audio_opt_path = os.path.join("results", f"{voice}_output.wav")
    write(audio_opt_path, 24000, np.concatenate(audios))`
Denshirenji-san commented 2 months ago

Hey, just want to say that this would be a really big improvement, I find StyleTTS2 really useful for my purposes and the biggest downside is the length it could generate. The ability to change the output style from the reference file is so good.

JonSingleton commented 2 months ago

I've been using the implementation in my pull request for ~2 days and haven't noticed any issues thus far. It's a bandaid fix and doesn't make it as if style tts is taking into account the full prompt as an overall input, but if the model file is well trained then it should come out consistently and you won't notice the concatenation of the files.

JarodMica commented 1 month ago

PR added