bshall / soft-vc

Soft speech units for voice conversion
https://bshall.github.io/soft-vc/
MIT License
398 stars 33 forks source link

Difference between SSL and PPG-based methods? #6

Open Kristopher-Chen opened 2 years ago

Kristopher-Chen commented 2 years ago

Hi, I really appreciate your work; the demo sounds great. I also read papers about PPG-based VC, which uses ASR for PPG extraction. I just wonder about the difference between SSL and PPG-based methods. It seems they both extract some information about linguistics. Have you ever compared them? Thank you!

bshall commented 1 year ago

Hi @Kristopher-Chen, thanks for the feedback!

There are some definite similarities between PPGs and the Soft Speech Units we proposed. The main difference is that soft units don't require text transcriptions to train. This can be useful for training VC systems in languages without large corpora of annotated speech. Additionally, things like laughter, breathing, etc. may be captured better by soft units than PPGs. Unfortunately, I haven't compared the approaches directly yet. I think it would be a useful benchmark but haven't had the chance to look into it.