bshall / acoustic-model

Acoustic models for: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
https://bshall.github.io/soft-vc/
MIT License
100 stars 24 forks source link

MultiSpeaker setup #6

Open rishikksh20 opened 2 years ago

rishikksh20 commented 2 years ago

Have you try this on multi-speaker way ?

rishikksh20 commented 2 years ago

@bshall Can we also replace Encoder and Decoder with Transformers ?

bshall commented 2 years ago

Hi @rishikksh20, sorry about the delay on this. I only noticed this issue now.

I have tried a multi-speaker setup (about 10 speakers) using one-hot codes for each speaker. It works pretty well but I think there is a small degradation compared to the single speaker model. In my experience fine-tuning the acoustic model on a small amount of target data seems to work better. I haven't experimented with using speaker embeddings for a zero-shot model though so can't comment on how well it performs in that setting.

I'd imagine that using Transformers would be fine. I don't think such heavy machinery is required though. I have done some experiments training the Hifi-GAN directly on the soft units (augmented with the pitch contours) and this seems to work well. It also simplifies the pipeline since it makes the acoustic model unnecessary.

juliakorovsky commented 1 year ago

@bshall Could you, please, tell more about HuBERT-to-HifiGAN experiments? What HifiGAN parameters should be changed? Did you use 256 dimension, like in HuBERT or did you retrain HuBERT with 128 dimension? How did you augmented soft units with pitch contours, somewhere in DataLoader or in Generator or Discriminator, where pitch was passed through nn.Embedding? Did you concatenated or added pitch contours to soft units?

seastar105 commented 1 year ago

@rishikksh20 hi, did you try soft-unit for multispeaker setup for any-to-many voice conversion? if so, did you success? i'm trying just using one-hot codes for multi speaker setup now, but suffering from speaker identity degradation. even though result speech speech is quite audible.

rishikksh20 commented 1 year ago

yes I feel the same with my training

MuruganR96 commented 1 year ago

@rishikksh20 @seastar105 Have you tried with VITs/YourTTS as an acoustic model + vocoder with the multispeaker setting?