Closed gobigrassland closed 1 month ago
可以交流学习一下嘛,我复现的结果嘴部张开的很差,想了解一下这是什么原因
Positional encoding is usually used with transformers because of their attention mechanism. The mechanism is used so that order in sequences matter. It is true that this is a single frame prediction problem, but voice features from whisper is a sequence.
Positional encoding is usually used with transformers because of their attention mechanism. The mechanism is used so that order in sequences matter. It is true that this is a single frame prediction problem, but voice features from whisper is a sequence.
是这样的。但是train_codes分支并没有对audio_feature加上position embedding,按道理是都加上,训练与推理保持一致。但是当前训练没有,推理加上了。我主要是对这一点有疑问。
另外在推理时去掉了pe,简单测试了几个case,视觉效果并没有太大改变。
Positional encoding is usually used with transformers because of their attention mechanism. The mechanism is used so that order in sequences matter. It is true that this is a single frame prediction problem, but voice features from whisper is a sequence.
是这样的。但是train_codes分支并没有对audio_feature加上position embedding,按道理是都加上,训练与推理保持一致。但是当前训练没有,推理加上了。我主要是对这一点有疑问。
另外在推理时去掉了pe,简单测试了几个case,视觉效果并没有太大改变。
按道理是都要加上的,整理代码时改错了。。。
Positional encoding is usually used with transformers because of their attention mechanism. The mechanism is used so that order in sequences matter. It is true that this is a single frame prediction problem, but voice features from whisper is a sequence.
是这样的。但是train_codes分支并没有对audio_feature加上position embedding,按道理是都加上,训练与推理保持一致。但是当前训练没有,推理加上了。我主要是对这一点有疑问。
另外在推理时去掉了pe,简单测试了几个case,视觉效果并没有太大改变。
感觉是直接沿用了sd的框架,把text侧的embedding换成了audio,应该是要加上的。
(1)推理代码inference.py 中对音频特征添加了位置编码特征
(2)train_codes分支中 训练与验证部分都没有这样做。 这样训练与推理就没有保证一致,为什么要这样做?位置编码是针对序列建模的,当前框架是基于单帧图片进行,虽然对应的音频特征采用了一定窗口范围的的音频特征,但本质上就还是单帧生成。此处的位置编码有什么意义?
辛苦作者帮忙答疑一下~