Question abount get_segmented_mels

def get_segmented_mels(self, spec, start_frame):
    mels = []
    assert syncnet_T == 5
    start_frame_num = self.get_frame_id(start_frame) + 1 # 0-indexing ---> 1-indexing
    if start_frame_num - 2 < 0: return None
    for i in range(start_frame_num, start_frame_num + syncnet_T):
        m = self.crop_audio_window(spec, i - 2)
        if m.shape[0] != syncnet_mel_step_size:
            return None
        mels.append(m.T)
    mels = np.asarray(mels)
    return mels

Why do you guys use mels surrounding the center frame as condition guidence to generate sync lip? It takes 5 windows with size of 16. At inference stage, you just use the center one of the 5 windows. Don't you consider there is inconsistency between training and inference? Why not use the exact mel window (start_frame + 16) corresponding to the target at training stage?

Rudrabha / Wav2Lip

Question abount get_segmented_mels #680