Closed rizwanishaq closed 3 years ago
The forward() function automatically re-shapes (B,5,1,80,16) to (B,1,80,16) during training: https://github.com/Rudrabha/Wav2Lip/blob/deeec76ee8dba10cad6ef133e068659faf707f1e/models/wav2lip.py#L93
audio_sequences = torch.cat([audio_sequences[:, i] for i in range(audio_sequences.size(1))], dim=0) but this will create B*5,1,80,16 instead of B,1,80,16 in training, but in inference B,1,80,16 and image is B,96,96,6???
Why we are not using Indiv_mels(B,5,1,80,16) in inference?