Higher Temporal Context window for Syncnet

I was wondering if there was a way to train Syncnet on a higher context window specifically 25 frames and 80 mel steps ( 80 corresponds to 1 second of audio). It would seem major changes would be needed in the architecture. Perhaps the Wav2lip generator speech encoder also shares the same architecture as the sync net speech encoder if you look closely. So the generator would need to output 25 samples before input into the lip sync discriminator?

Any tips on this would be appreciated. I think with a higher context window you could achieve even better sync.

Rudrabha / Wav2Lip

Higher Temporal Context window for Syncnet #674