训练与推理部分Position Embedding部分不一致

TMElyralab / MuseTalk

MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting

Other

2.52k stars 310 forks source link

训练与推理部分Position Embedding部分不一致 #136

Closed gobigrassland closed 1 month ago

gobigrassland commented 3 months ago

（1）推理代码inference.py 中对音频特征添加了位置编码特征

audio_feature_batch = torch.from_numpy(whisper_batch)
audio_feature_batch = audio_feature_batch.to(device=unet.device, dtype=unet.model.dtype) # torch, B, 5*N,384
audio_feature_batch = pe(audio_feature_batch)

（2）train_codes分支中训练与验证部分都没有这样做。这样训练与推理就没有保证一致，为什么要这样做？位置编码是针对序列建模的，当前框架是基于单帧图片进行，虽然对应的音频特征采用了一定窗口范围的的音频特征，但本质上就还是单帧生成。此处的位置编码有什么意义？

辛苦作者帮忙答疑一下~

liuzysy commented 3 months ago

可以交流学习一下嘛，我复现的结果嘴部张开的很差，想了解一下这是什么原因

xiankgx commented 2 months ago

Positional encoding is usually used with transformers because of their attention mechanism. The mechanism is used so that order in sequences matter. It is true that this is a single frame prediction problem, but voice features from whisper is a sequence.

gobigrassland commented 2 months ago

Positional encoding is usually used with transformers because of their attention mechanism. The mechanism is used so that order in sequences matter. It is true that this is a single frame prediction problem, but voice features from whisper is a sequence.

是这样的。但是train_codes分支并没有对audio_feature加上position embedding，按道理是都加上，训练与推理保持一致。但是当前训练没有，推理加上了。我主要是对这一点有疑问。

另外在推理时去掉了pe，简单测试了几个case，视觉效果并没有太大改变。

czk32611 commented 2 months ago

Positional encoding is usually used with transformers because of their attention mechanism. The mechanism is used so that order in sequences matter. It is true that this is a single frame prediction problem, but voice features from whisper is a sequence.

是这样的。但是train_codes分支并没有对audio_feature加上position embedding，按道理是都加上，训练与推理保持一致。但是当前训练没有，推理加上了。我主要是对这一点有疑问。

另外在推理时去掉了pe，简单测试了几个case，视觉效果并没有太大改变。

按道理是都要加上的，整理代码时改错了。。。

leeguandong commented 2 months ago

Positional encoding is usually used with transformers because of their attention mechanism. The mechanism is used so that order in sequences matter. It is true that this is a single frame prediction problem, but voice features from whisper is a sequence.

是这样的。但是train_codes分支并没有对audio_feature加上position embedding，按道理是都加上，训练与推理保持一致。但是当前训练没有，推理加上了。我主要是对这一点有疑问。

另外在推理时去掉了pe，简单测试了几个case，视觉效果并没有太大改变。

感觉是直接沿用了sd的框架，把text侧的embedding换成了audio，应该是要加上的。