TMElyralab / MuseTalk

MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting
Other
2.52k stars 310 forks source link

训练与推理部分Position Embedding部分不一致 #136

Closed gobigrassland closed 1 month ago

gobigrassland commented 3 months ago

(1)推理代码inference.py 中对音频特征添加了位置编码特征

audio_feature_batch = torch.from_numpy(whisper_batch)
audio_feature_batch = audio_feature_batch.to(device=unet.device, dtype=unet.model.dtype) # torch, B, 5*N,384
audio_feature_batch = pe(audio_feature_batch)

(2)train_codes分支中 训练与验证部分都没有这样做。 这样训练与推理就没有保证一致,为什么要这样做?位置编码是针对序列建模的,当前框架是基于单帧图片进行,虽然对应的音频特征采用了一定窗口范围的的音频特征,但本质上就还是单帧生成。此处的位置编码有什么意义?

辛苦作者帮忙答疑一下~

liuzysy commented 3 months ago

可以交流学习一下嘛,我复现的结果嘴部张开的很差,想了解一下这是什么原因

xiankgx commented 2 months ago

Positional encoding is usually used with transformers because of their attention mechanism. The mechanism is used so that order in sequences matter. It is true that this is a single frame prediction problem, but voice features from whisper is a sequence.

gobigrassland commented 2 months ago

Positional encoding is usually used with transformers because of their attention mechanism. The mechanism is used so that order in sequences matter. It is true that this is a single frame prediction problem, but voice features from whisper is a sequence.

是这样的。但是train_codes分支并没有对audio_feature加上position embedding,按道理是都加上,训练与推理保持一致。但是当前训练没有,推理加上了。我主要是对这一点有疑问。

另外在推理时去掉了pe,简单测试了几个case,视觉效果并没有太大改变。

czk32611 commented 2 months ago

Positional encoding is usually used with transformers because of their attention mechanism. The mechanism is used so that order in sequences matter. It is true that this is a single frame prediction problem, but voice features from whisper is a sequence.

是这样的。但是train_codes分支并没有对audio_feature加上position embedding,按道理是都加上,训练与推理保持一致。但是当前训练没有,推理加上了。我主要是对这一点有疑问。

另外在推理时去掉了pe,简单测试了几个case,视觉效果并没有太大改变。

按道理是都要加上的,整理代码时改错了。。。

leeguandong commented 2 months ago

Positional encoding is usually used with transformers because of their attention mechanism. The mechanism is used so that order in sequences matter. It is true that this is a single frame prediction problem, but voice features from whisper is a sequence.

是这样的。但是train_codes分支并没有对audio_feature加上position embedding,按道理是都加上,训练与推理保持一致。但是当前训练没有,推理加上了。我主要是对这一点有疑问。

另外在推理时去掉了pe,简单测试了几个case,视觉效果并没有太大改变。

感觉是直接沿用了sd的框架,把text侧的embedding换成了audio,应该是要加上的。