Open p0p4k opened 1 year ago
I think you are not feeding the flow output into the decoder maybe decoder_inputs = z_hat * y_mask ?
For better output, I recommend three things.
You need to find text alignment to sample aligned prior as algorithm. You need to give text condition to find it.
You need to fit speaker's mean pitch. For e.g. male-female conversion, you need to find optimal scope-shift s
to shift more value instead of zero as self.crop_scope([z_yin], 0)[0]
. Its why scope-shift s
is mentioned in algorithm.
You need iteration in algorithm. It provides more stable output.
I tried. But output is not satisfactory (voice doesn't change that much). Am i doing anything wrong? Thanks.