This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs
def get_segmented_mels(self, spec, start_frame):
mels = []
assert syncnet_T == 5
start_frame_num = self.get_frame_id(start_frame) + 1 # 0-indexing ---> 1-indexing
if start_frame_num - 2 < 0: return None
for i in range(start_frame_num, start_frame_num + syncnet_T):
m = self.crop_audio_window(spec, i - 2)
if m.shape[0] != syncnet_mel_step_size:
return None
mels.append(m.T)
mels = np.asarray(mels)
return mels
Why do you guys use mels surrounding the center frame as condition guidence to generate sync lip? It takes 5 windows with size of 16. At inference stage, you just use the center one of the 5 windows. Don't you consider there is inconsistency between training and inference? Why not use the exact mel window (start_frame + 16) corresponding to the target at training stage?
Why do you guys use mels surrounding the center frame as condition guidence to generate sync lip? It takes 5 windows with size of 16. At inference stage, you just use the center one of the 5 windows. Don't you consider there is inconsistency between training and inference? Why not use the exact mel window (start_frame + 16) corresponding to the target at training stage?