OlaWod / FreeVC

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
MIT License
602 stars 111 forks source link

Why is the speaker embedding g used to condition the Posterior Encoder and the Decoder? #88

Open st-vincent1 opened 10 months ago

st-vincent1 commented 10 months ago

I am confused why the speaker embedding g is used to condition multiple model components (Posterior Encoder, Decoder, Flow) as opposed to just Flow.

From the model diagram in Fig. 1 (a) (Training procedure), the speaker embedding g is used to condition the normalising Flow. This makes sense: at inference time, this information in the reversed Flow to reverse the z' distribution into a speaker-informed z which was modelled after the real data x_lin with the Posterior Encoder.

To me this seems like enough supervision, and I am confused why g is used in other places too: