PlayVoice / vits_chinese

Best practice TTS based on BERT and VITS with some Natural Speech Features Of Microsoft; Support ONNX streaming out!
https://huggingface.co/spaces/maxmax20160403/vits_chinese
MIT License
1.16k stars 168 forks source link

Hoping for your result #1

Closed lucasjinreal closed 1 year ago

lucasjinreal commented 3 years ago

Hoping for your result trained vits on Chinese

MaxMax2016 commented 3 years ago

i update

lucasjinreal commented 3 years ago

thanks, I'll try train it. Have u tested speed between vits and tactron2? Which do u think is better in terms of speed and quality?

MaxMax2016 commented 3 years ago

of course, vits is better.it is so amazing

lucasjinreal commented 3 years ago

how about inference speed?

lucasjinreal commented 3 years ago

Do u think it worthy to deploy (or reasonable) ? If so, I can help deploy to TensorRT and make a C++ inference demo, also, tvm also applicable if the speed is good.

MaxMax2016 commented 3 years ago

the vits_样本.wav is about 100 Seconds, it spends 800ms of a 1080 GPU to inference. if you need more fast, you can change the decoder from hifigan to mb melgan.

MaxMax2016 commented 3 years ago

the technology in vits: vae & normlizing flow & gan & mas & multi task train & adaptability to long sentences etc, i think it is a general frame work of tts in the future.

lucasjinreal commented 3 years ago

@dtx525942103 800ms for 100s, 20-30s need only 200ms, which is tolerrenable, if using TensorRT accelerate it, can be 3x faster average. Seems can be even run on some low level compute devices such as Raspberry pi.

MaxMax2016 commented 3 years ago

@jinfagang your wav is 16K? train.log

MaxMax2016 commented 3 years ago

you can train ljspeech use official vits first

HallidayReadyOne commented 3 years ago

Ok, i will try. Thank you.

HallidayReadyOne commented 3 years ago

I think I have solved the problem, which is caused by the compilation of monotonic align. Thanks for help.

MaxMax2016 commented 3 years ago

en en