Closed DonkeyHang closed 2 months ago
We keep exploring it and set real time stream inference as one of the final objectives to achieve. Theoretically it is possible
We keep exploring it and set real time stream inference as one of the final objectives to achieve. Theoretically it is possible
thats coool!!!
I have some questions about the basic components of voice conversion (VC) and I am seeking your opinions and suggestions.
From my current understanding, most potential implementations of stream mode VC processing involve pre-processing a reference timbre, such as Xmel, to obtain a vector, such as g. The real-time input signal, such as y, is first processed through a module like HuBERT to obtain ASR variables such as c. Then, using typical VITS modules like the Encoder and Flow, combined with the previously obtained reference timbre g, a latent variable such as z is produced. Finally, the output Yhat is generated through the Decoder. This workflow seems fine for offline processing, but in real-time, modules like HuBERT have some dependencies on the minimum length of input (too short or somewhat short inputs result in very poor recognition performance). Moreover, in stream mode, the overlapping process introduces some "repeated sounds". I suspect this is due to overlaps and omissions occurring during ASR result processing. Do you have any good solutions for handling such issues?
Some of the repetitions can be mitigated by crossfading between adjacent chunks, this works for most LTI models (pure convolution based) For transformer models like HuBERT, I have no idea whether the same method works or not
ok,thanks bro I had try crossfading way and overlap for half window, but the result is bad in HuBERT or TextEncoder(vits), emmmmm
hi there, the offline effects is well, so have possiable to achieve in realtime stream mode?