bshall / knn-vc

Voice Conversion With Just Nearest Neighbors
https://bshall.github.io/knn-vc/
Other
452 stars 67 forks source link

Conversion output has very strong similarity to source audio. #18

Closed Sassun closed 1 year ago

Sassun commented 1 year ago

I have experimented quite a lot and I have to acknowledge that the output audio is very similar to the source audio tone.

What strategies or changes would you suggest to increase the similarity to the references?

bshall commented 1 year ago

Hi @Sassun, it's not entirely clear from your description what the exact problem you're experiencing is. Could you give us some reproducible examples so we can understand the issue better?

Sassun commented 1 year ago

My question is basically how can I improve the conversion quality, so that the output (converted file) sounds more similar to the target voice. I am specifically testing it out with target audio files that where the speaker has accent (e.g. german accent when speaking english)

Whats the best way to provide you some concrete examples privately?

bshall commented 1 year ago

Unfortunately, I think it will be difficult to get kNN-VC to convert between accents. Disentangling accent and content is still an unsolved problem in text-free voice conversion, so the converted speech generally retains the accent of the source speech. I think current approaches to accent conversion either rely on accent labelled speech or text transcriptions for training. I'm not too familiar with the literature but it might be possible to incorporate some of these ideas into the kNN-VC pipeline.

If you to want to share some results with us I think the best way is through google drive (or some other shareable cloud drive). You can invite me through my email address: benjamin.l.van.niekerk@gmail.com

chiaki-luo commented 4 months ago

Unfortunately, I think it will be difficult to get kNN-VC to convert between accents. Disentangling accent and content is still an unsolved problem in text-free voice conversion, so the converted speech generally retains the accent of the source speech. I think current approaches to accent conversion either rely on accent labelled speech or text transcriptions for training. I'm not too familiar with the literature but it might be possible to incorporate some of these ideas into the kNN-VC pipeline.

If you to want to share some results with us I think the best way is through google drive (or some other shareable cloud drive). You can invite me through my email address: benjamin.l.van.niekerk@gmail.com

Dear author, you mentioned that you replaced each query frame with the average of its k-nearest neighbors in the matching set. Intuitively, the converted speech should match the accent and intonation of the matching set. However, testing shows that it still contains the accent and intonation of the query set, only changing the timbre. This is quite puzzling to me.