How the encoder and decoder are trained?

Shuo-H commented 1 year ago

We are trying to do some experiments using this formant estimation model, and we observe in latent space there are formants other than first three. clean_latent_formants Are these formants also trained?

yosishrem commented 1 year ago

The encoder weights are trained by back-propagating gradients from the decoders. The decoders' outputs are the only ones connected to the loss function. Since the encoder is based on 2d conv when convolving each kernel, it's invariant to its location within the spectrogram, therefor while encapturing information relating to the first 3 formats, it also manages to capture information of any formant in the spectrogram. Overall the encoder tries to sharpen the spectrogram by marking all the formants, and the decoders, which are 1d conv, seeing the entire spectral range (location-aware), are the ones that output each of the first 3 formants.

Shuo-H commented 1 year ago

Hi Yosi,

Appreciate your swift response! The rationale behind using a 2D convolution for ensuring generalization of the encoder across varying datasets seems logical. I was intrigued by Figure 2 in your publication - it does an excellent job of illustrating the need for a shared encoder - The similarities in formant shapes on the spectrogram between different people.

However, I noticed that the Vocal Tract Resonance (VTR) dataset you employed for training only provides ground truth data for the initial four formants. Given that the shared encoder has only been trained to recognize the distribution of energy at lower frequencies, I'm wondering how it copes with the lack of ground truth data for formants at higher frequencies. Would it still be capable of learning their distribution?

MLSpeech / FormantsTracker

How the encoder and decoder are trained? #1