PlayVoice / whisper-vits-svc

Core Engine of Singing Voice Conversion & Singing Voice Clone
https://huggingface.co/spaces/maxmax20160403/sovits5.0
MIT License
2.55k stars 914 forks source link

Increasing SVC inference speed #193

Open KaikeWesleyReis opened 1 month ago

KaikeWesleyReis commented 1 month ago

Hi, I'm developing a personal project of a conversational chatbot. The idea is quite simple: Have a chat with Harbinger, the first reaper (from mass effect series). I found a optimal solution to generate his voice through text: Using a vits-ljspeech-base from Coqui TTS (without any fine tuning) to generate a audio and use your SVC fine tuned to add the voice over the generated audio. For example, given this sentence:

Organic intellect, fascinated by the patterns of the universe. I, Harbinger, have witnessed the harmony of numbers governing the cosmos. The intricate dance of primes, the elegance of elliptic curves, and the recursion of Fibonacci's sequence all resonate with my being. Which aspect of number theory would you like to dissect, researcher?

I have this time for each step to : time

Now I'm studying the inference code of your model and so far I have the following ideas:

It's possible to cut or pre-generate any vector to reduce other models inference (whispper, hubert, pitch and so on) and thus, svc inference time?

Btw, thanks for your repository: is the easiest for "prepare your data and run" that I got so far in deep learning field.

Cheers from Brazil,

ShadowLoveElysia commented 1 month ago

Hey, I understand your thinking, and what you're doing is totally fine. But I have to give you a reality check. Yes, SVC can be used for audio replacement, but it seems like you're over-engineering it. You could just use TTS projects like GPT-Sovits instead of converting an existing TTS.

ShadowLoveElysia commented 1 month ago

If you want to do voice conversion, then this project is definitely fine, but if your goal is just TTS, then GPT-Sovits is sufficient.

KaikeWesleyReis commented 1 month ago

@ShadowLoveElysia

If you want to do voice conversion, then this project is definitely fine, but if your goal is just TTS, then GPT-Sovits is sufficient.

GPT-Sovits have the same idea of VITS fine tuning that I have done? Don't you think that I'll fall in the same mistakes of VITS fine tuning?

My voice is this: https://www.youtube.com/watch?v=YZt6NKrkdzQ&

Given the voice nature, do you believe that is possible to fine tune GPT-Sovits?