Question about the Sequence Discriminator

Hello. I read your paper on speech-driven facial animations, and it's good to see your code that explains the overall architecture of the model.

I have some questions on the sequence discriminator described on your paper. In the paper, you mentioned that the frames at each time steps are encoded using a CNN, and fed into a two-layer GRU. Is this CNN identical to the Identity Encoder used for the Generator?

You also mentioned adding the audio as a conditional input to the network. How is this audio encoded, and how is it added to the input?

Thanks

DinoMan / speech-driven-animation

Question about the Sequence Discriminator #36