Closed lucasjinreal closed 1 year ago
i update
thanks, I'll try train it. Have u tested speed between vits and tactron2? Which do u think is better in terms of speed and quality?
of course, vits is better.it is so amazing
how about inference speed?
Do u think it worthy to deploy (or reasonable) ? If so, I can help deploy to TensorRT and make a C++ inference demo, also, tvm also applicable if the speed is good.
the vits_样本.wav is about 100 Seconds, it spends 800ms of a 1080 GPU to inference. if you need more fast, you can change the decoder from hifigan to mb melgan.
the technology in vits: vae & normlizing flow & gan & mas & multi task train & adaptability to long sentences etc, i think it is a general frame work of tts in the future.
@dtx525942103 800ms for 100s, 20-30s need only 200ms, which is tolerrenable, if using TensorRT accelerate it, can be 3x faster average. Seems can be even run on some low level compute devices such as Raspberry pi.
@jinfagang your wav is 16K? train.log
you can train ljspeech use official vits first
Ok, i will try. Thank you.
I think I have solved the problem, which is caused by the compilation of monotonic align. Thanks for help.
en en
Hoping for your result trained vits on Chinese