KevinWang676 / Bark-Voice-Cloning

Bark Voice Cloning and Voice Cloning for Chinese Speech
MIT License
2.63k stars 373 forks source link

Please share your dataset, and make an entry on mylo's repo #15

Closed Subarasheese closed 11 months ago

Subarasheese commented 1 year ago

Greetings,

Mylo, the author of the quantizer repo (https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/) is also collecting the datasets for the multiple languages people trained Bark's Hubert quantizers, as you can see here:

https://github.com/gitmylo/Voice-cloning-quantizers/

Can you please make a Pull Request in this repo and add the chinese model and dataset you trained on?

This is very important, as there are plans that we could merge all datasets in the future and train a single multi-language voice cloning model, where, for example with chinese, we could use the multilanguage tokenizer to make an english voice speak chinese and vice-versa, or make a spanish voice speak chinese etc.

KevinWang676 commented 1 year ago

Hi, thanks for your interest. The voice cloning for Chinese speech in this repo is not based on Bark's Hubert quantizers. It's based on SambertHifigan developed by Alibaba. I know Mylo's work, which greatly improved the quality of Bark voice cloning. If you have any ideas on training Mylo's model on Chinese datasets, feel free to contact me.

Subarasheese commented 1 year ago

Does that have the dataset it was trained on available? Is that a model, or uses any model, and if so, how similar it is to either Wav2Vec or Hubert? Can the weights be merged?

Regarding training using Mylo's solution, I had some success training the regular Hubert model at a lower Learning Rate and a larger dataset, in another language. But from what I heard from Chinese users, the Chinese TTS itself from Bark is pretty bad. Is that true? Does your solution produce better outputs? Because if so, you can use it as a starting point to generate a better dataset for Mylo's repo.

KevinWang676 commented 1 year ago

But from what I heard from Chinese users, the Chinese TTS itself from Bark is pretty bad. Is that true?

Yes, and that's why I adopted a different approach from Bark TTS. I believe the voice cloning method developed by Alibaba is based on this pre-trained model. But I don't think they have released the dataset that they trained the model on.

To use SambertHifigan for voice cloning is easy. It's like a finetuning process, so it doesn't take much time. A similar project is VITS-fast-fine-tuning. For now, I don't see a close connection between SambertHifigan and Bark's Hubert quantizers.

Subarasheese commented 1 year ago

Since Bark's native TTS capabilities for Chinese is bad, indeed trying the approach from Mylo's repo directly is a wasted effort.

Does your solution improve Chinese TTS overall for Bark? And I don't mean the voice cloning itself, I mean producing more accurate audio outputs.

If so, there is something that can be tried, @KevinWang676 : do you think if one creates a custom script on the prepare step from Mylo's repo using your solution for Chinese TTS, particularly the "create_wavs" step, would we be able to train a Chinese Hubert quantizer? Since the wav outputs would be more accurate, I believe we could get better results.

KevinWang676 commented 1 year ago

Does your solution improve Chinese TTS overall for Bark?

Yes, the new solution is more stable and sounds more native.

do you think if one creates a custom script on the prepare step from Mylo's repo using your solution for Chinese TTS

Maybe yes, I think it's just a Chinese TTS process.

would we be able to train a Chinese Hubert quantizer

However, Bark's native TTS capabilities for Chinese is bad, so we still don't want to use Bark even if we get a Chinese Hubert quantizer. Is there any other way to get around Bark? Thanks.