Indiv_mels(B,5,1,80,16) in wav2lip_train and but in inference it is (B,1,80,16)?

Rudrabha / Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs

https://synclabs.so

9.8k stars 2.13k forks source link

Indiv_mels(B,5,1,80,16) in wav2lip_train and but in inference it is (B,1,80,16)? #200

Closed rizwanishaq closed 3 years ago

rizwanishaq commented 3 years ago

Why we are not using Indiv_mels(B,5,1,80,16) in inference?

prajwalkr commented 3 years ago

The forward() function automatically re-shapes (B,5,1,80,16) to (B,1,80,16) during training: https://github.com/Rudrabha/Wav2Lip/blob/deeec76ee8dba10cad6ef133e068659faf707f1e/models/wav2lip.py#L93

rizwanishaq commented 3 years ago

audio_sequences = torch.cat([audio_sequences[:, i] for i in range(audio_sequences.size(1))], dim=0) but this will create B*5,1,80,16 instead of B,1,80,16 in training, but in inference B,1,80,16 and image is B,96,96,6???