Hiroshiba / realtime-yukarin

An application for real-time voice conversion
MIT License
330 stars 51 forks source link

Questions and documentation #7

Open BradKML opened 3 years ago

BradKML commented 3 years ago
  1. Which model does Yukarin uses for its training?
  2. Are there any target voice training document specifications?
  3. Would public voice datasets help with training?
  4. Does this project work with English datasets?
  5. Why is the example page's voice so "robotic"/"compressed"?
BradKML commented 3 years ago

k2kobayashi's toolkit seemed the most similar

Other Repos

BradKML commented 3 years ago

And in regards to Speech Upsampling or Speech Superresolution:

SinisterSpatula commented 3 years ago
  1. Which model does Yukarin uses for its training?

I'm not sure but I'm guessing GAN. It generates, discriminates, and has an adversary. I'm new to this stuff though, and just playing with it as a hobby and learning.

  1. Are there any target voice training document specifications?

It's working for me with 24000 hz 16 bit wav's made in audacity. The audio pairs should be around 15 seconds or less each (seems okay to go slightly over that, as long as your system has enough ram.)

  1. Would public voice datasets help with training?

You could use those if you like. I tried out the JSV I think it's called, it worked well. I just removed any very short clips. Finally I switched to using audio books and used audacity to label the sounds with minimum 6 seconds. (Short clips can cause the process to crash). You just need to build a parallel dataset of audio, of your own voice and target.

  1. Does this project work with English datasets?

Yes, I can confirm it does. If you want to hear a sample, I'll be sharing my english results in the yukarin discord. I had decent results with 212 audio pairs (some phonemes were silent or missing and the audio was more wobbly), and very good/better results with 512. I might try 1,000 in the future.

  1. Why is the example page's voice so "robotic"/"compressed"?

It might have been because it was only showing the stage 1 training, I'm unsure. To me, the second stage of training (using the pix2pix I think it is (where it's generating a higher quality sound by turning the audio into a picture) seem to really bring the quality and naturalness back to it again. I learned not to judge it too much on the stage 1 quality, wait for second stage to truly appreciate what it can do. It's very impressive IMO. I have not tried the real-time conversion yet, I'm going to soon. It could be that the real-time conversion has lower quality to speed up processing. I'm hoping I can achieve the quality I've seen in my test output wav's without too much delay, but I'll be finding out soon.

Those repositories you linked are all very cool and interesting, however this was the only series of projects that seemed to offer real time conversion. Does anyone know if it's possible to adapt any of those other projects to become real time? Or did I miss one of them that actually does offer real time conversion?