anonymous-pits / pits

PITS: Variational Pitch Inference for End-to-end Pitch-controllable TTS without External Pitch Predictor
https://anonymous-pits.github.io/pits/
MIT License
275 stars 34 forks source link

voice conversion #25

Open p0p4k opened 1 year ago

p0p4k commented 1 year ago

I tried. But output is not satisfactory (voice doesn't change that much). Am i doing anything wrong? Thanks.

    def voice_conversion(self, spec, spec_length, ying, ying_length, g_src, g_tgt):
        z_spec, m_spec, logs_spec, spec_mask = self.net_g.enc_spec(spec, spec_length, g=g_src)
        z_yin, m_yin, logs_yin, yin_mask = self.net_g.enc_pitch(ying, ying_length, g=g_src)
        z_yin_crop, logs_yin_crop, m_yin_crop = self.net_g.crop_scope(
                    [z_yin, logs_yin, m_yin], scope_shift=0)
        z = torch.cat([z_spec, z_yin], dim=1)
        y_mask = spec_mask

        z_p = self.flow(z, y_mask, g=g_src)
        z_hat = self.flow(z_p, y_mask, g=g_tgt, reverse=True)
        z_spec, z_yin = torch.split(z,self.inter_channels - self.yin_channels,  dim=1)
        z_yin_crop = self.crop_scope([z_yin], 0)[0]
        z_crop = torch.cat([z_spec, z_yin_crop], dim=1)
        decoder_inputs = z_crop * y_mask

        o_hat = self.dec(decoder_inputs, g=g_tgt)

        return o_hat, y_mask, (z, z_p, z_hat)
meriamOu commented 1 year ago

I think you are not feeding the flow output into the decoder maybe decoder_inputs = z_hat * y_mask ?

anonymous-pits commented 1 year ago

For better output, I recommend three things.

  1. You need to find text alignment to sample aligned prior as algorithm. You need to give text condition to find it.

    image
  2. You need to fit speaker's mean pitch. For e.g. male-female conversion, you need to find optimal scope-shift s to shift more value instead of zero as self.crop_scope([z_yin], 0)[0]. Its why scope-shift s is mentioned in algorithm.

    image
  3. You need iteration in algorithm. It provides more stable output.

    image