Open rishikksh20 opened 2 years ago
@bshall Can we also replace Encoder and Decoder with Transformers ?
Hi @rishikksh20, sorry about the delay on this. I only noticed this issue now.
I have tried a multi-speaker setup (about 10 speakers) using one-hot codes for each speaker. It works pretty well but I think there is a small degradation compared to the single speaker model. In my experience fine-tuning the acoustic model on a small amount of target data seems to work better. I haven't experimented with using speaker embeddings for a zero-shot model though so can't comment on how well it performs in that setting.
I'd imagine that using Transformers would be fine. I don't think such heavy machinery is required though. I have done some experiments training the Hifi-GAN directly on the soft units (augmented with the pitch contours) and this seems to work well. It also simplifies the pipeline since it makes the acoustic model unnecessary.
@bshall Could you, please, tell more about HuBERT-to-HifiGAN experiments? What HifiGAN parameters should be changed? Did you use 256 dimension, like in HuBERT or did you retrain HuBERT with 128 dimension? How did you augmented soft units with pitch contours, somewhere in DataLoader or in Generator or Discriminator, where pitch was passed through nn.Embedding? Did you concatenated or added pitch contours to soft units?
@rishikksh20 hi, did you try soft-unit for multispeaker setup for any-to-many voice conversion? if so, did you success? i'm trying just using one-hot codes for multi speaker setup now, but suffering from speaker identity degradation. even though result speech speech is quite audible.
yes I feel the same with my training
@rishikksh20 @seastar105 Have you tried with VITs/YourTTS as an acoustic model + vocoder with the multispeaker setting?
Have you try this on multi-speaker way ?