代码中视频帧与音频特征对齐索引，推理和训练看起来不一致

TMElyralab / MuseTalk

MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting

Other

1.84k stars 219 forks source link

代码中视频帧与音频特征对齐索引，推理和训练看起来不一致 #108

Closed gobigrassland closed 1 month ago

gobigrassland commented 1 month ago

推理代码中调用提取音频特征，musetalk/whisper/audio2feature.py，其中定位音频特征索引是center_idx = int(vid_idx*50/fps)

    def get_sliced_feature(self,
                           feature_array, 
                           vid_idx, 
                           audio_feat_length=[2,2],
                           fps=25):

        center_idx = int(vid_idx*50/fps) 
        left_idx = center_idx-audio_feat_length[0]*2
        right_idx = center_idx + (audio_feat_length[1]+1)*2

而train_codes分支训练代码，包括Wav2Lip代码中，定位的音频索引是start_idx = int(80. * (start_frame_num / float(hparams.fps)))

    def crop_audio_window(self, spec, start_frame):
        if type(start_frame) == int:
            start_frame_num = start_frame
        else:
            start_frame_num = self.get_frame_id(start_frame)
        start_idx = int(80. * (start_frame_num / float(hparams.fps)))

        end_idx = start_idx + syncnet_mel_step_size

        return spec[start_idx : end_idx, :]

这两处系数一个是50，一个是80，怎么没有保持一致呢？

gobigrassland commented 1 month ago

我对比了一下wav2lip 与当前推理代码，发现wav2lip使用audio库提取频谱特征，而当前推理代码是使用whisper提取特征。这两种提取特征应该是有差异的。但不清楚train_codes分支为啥这样写，提取音频部分都没有和推理代码一致

jinqinn commented 1 month ago

@gobigrassland 参考https://github.com/TMElyralab/MuseTalk/pull/62

gobigrassland commented 1 month ago

@gobigrassland 参考#62

这个issue提到的问题我注意到了，音频特征分块后数量不一定与视频帧一致。我预处理时进行了过滤。当我将视频帧率和音频调整到25fps, 16kHZ后，当这两者差异超过3，我就丢弃掉了。

主要是想确认一下推理代码中的

 center_idx = int(vid_idx*50/fps)

与train_codes中Dataloader.py中

 start_idx = int(80. * (start_frame_num / float(hparams.fps)))

让我感到困惑

czk32611 commented 1 month ago

@gobigrassland 参考#62

这个issue提到的问题我注意到了，音频特征分块后数量不一定与视频帧一致。我预处理时进行了过滤。当我将视频帧率和音频调整到25fps, 16kHZ后，当这两者差异超过3，我就丢弃掉了。

主要是想确认一下推理代码中的
 center_idx = int(vid_idx*50/fps) 
与train_codes中Dataloader.py中
 start_idx = int(80. * (start_frame_num / float(hparams.fps)))
让我感到困惑

确认了一下，crop_audio_window并没有被使用。是代码没整理干净。。。

实际上我们是预先保存与推理代码相同的whisper特征，并在dataloader里读进来。见这里