Closed cavalleria closed 2 years ago
I test my own wav file, cmd is
python demo.py --model_name vocaset --wav_path "demo/wav/fb6d70e9cd7b2bed30fa1504330180f3.wav" --dataset vocaset --vertice_dim 15069 --feature_dim 64 --period 30 --fps 30 --train_subjects "FaceTalk_170728_03272_TA FaceTalk_170904_00128_TA FaceTalk_170725_00137_TA FaceTalk_170915_00223_TA FaceTalk_170811_03274_TA FaceTalk_170913_03279_TA FaceTalk_170904_03276_TA FaceTalk_170912_03278_TA" --test_subjects "FaceTalk_170809_00138_TA FaceTalk_170731_00024_TA" --condition FaceTalk_170913_03279_TA --subject FaceTalk_170809_00138_TA
but output error as below
Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Traceback (most recent call last): File "demo.py", line 204, in <module> main() File "demo.py", line 200, in main test_model(args) File "demo.py", line 57, in test_model prediction = model.predict(audio_feature, template, one_hot) File "/evo_860/yaobin.li/workspace/FaceFormer/faceformer.py", line 157, in predict vertice_out = self.transformer_decoder(vertice_input, hidden_states, tgt_mask=tgt_mask, memory_mask=memory_mask) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 248, in forward output = mod(output, memory, tgt_mask=tgt_mask, File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 451, in forward x = self.norm1(x + self._sa_block(x, tgt_mask, tgt_key_padding_mask)) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 460, in _sa_block x = self.self_attn(x, x, x, File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1003, in forward attn_output, attn_output_weights = F.multi_head_attention_forward( File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/functional.py", line 5016, in multi_head_attention_forward raise RuntimeError(f"The shape of the 3D attn_mask is {attn_mask.shape}, but should be {correct_3d_size}.") RuntimeError: The shape of the 3D attn_mask is torch.Size([4, 600, 600]), but should be (4, 601, 601)
and i check my wav file info
ffmpeg -i ~/wks/FaceFormer/demo/wav/fb6d70e9cd7b2bed30fa1504330180f3.wav ffmpeg version 4.3 Copyright (c) 2000-2020 the FFmpeg developers built with gcc 7.3.0 (crosstool-NG 1.23.0.449-a04d0) configuration: --prefix=/home/yaobin.li/soft/miniconda3/envs/wenet --cc=/opt/conda/conda-bld/ffmpeg_1597178665428/_build_env/bin/x86_64-conda_cos6-linux-gnu-cc --disable-doc --disable-openssl --enable-avresample --enable-gnutls --enable-hardcoded-tables --enable-libfreetype --enable-libopenh264 --enable-pic --enable-pthreads --enable-shared --disable-static --enable-version3 --enable-zlib --enable-libmp3lame libavutil 56. 51.100 / 56. 51.100 libavcodec 58. 91.100 / 58. 91.100 libavformat 58. 45.100 / 58. 45.100 libavdevice 58. 10.100 / 58. 10.100 libavfilter 7. 85.100 / 7. 85.100 libavresample 4. 0. 0 / 4. 0. 0 libswscale 5. 7.100 / 5. 7.100 libswresample 3. 7.100 / 3. 7.100 Guessed Channel Layout for Input Stream #0.0 : mono Input #0, wav, from '/home/yaobin.li/wks/FaceFormer/demo/wav/fb6d70e9cd7b2bed30fa1504330180f3.wav': Metadata: encoder : Lavf58.45.100 Duration: 00:01:37.94, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
So is it audio issues?
The default value of max_seq_len is 600. If you'd like to use longer audio, e.g., 1~3min,
in the "demo.py" file, please add:
from faceformer import PeriodicPositionalEncoding, init_biased_mask
model.PPE = PeriodicPositionalEncoding(args.feature_dim, period = args.period, max_seq_len=6000)
model.biased_mask = init_biased_mask(n_head = 4, max_seq_len = 6000, period=args.period)
after the line27: model.load_state_dict(.....).
I test my own wav file, cmd is
python demo.py --model_name vocaset --wav_path "demo/wav/fb6d70e9cd7b2bed30fa1504330180f3.wav" --dataset vocaset --vertice_dim 15069 --feature_dim 64 --period 30 --fps 30 --train_subjects "FaceTalk_170728_03272_TA FaceTalk_170904_00128_TA FaceTalk_170725_00137_TA FaceTalk_170915_00223_TA FaceTalk_170811_03274_TA FaceTalk_170913_03279_TA FaceTalk_170904_03276_TA FaceTalk_170912_03278_TA" --test_subjects "FaceTalk_170809_00138_TA FaceTalk_170731_00024_TA" --condition FaceTalk_170913_03279_TA --subject FaceTalk_170809_00138_TA
but output error as below
Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Traceback (most recent call last): File "demo.py", line 204, in <module> main() File "demo.py", line 200, in main test_model(args) File "demo.py", line 57, in test_model prediction = model.predict(audio_feature, template, one_hot) File "/evo_860/yaobin.li/workspace/FaceFormer/faceformer.py", line 157, in predict vertice_out = self.transformer_decoder(vertice_input, hidden_states, tgt_mask=tgt_mask, memory_mask=memory_mask) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 248, in forward output = mod(output, memory, tgt_mask=tgt_mask, File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 451, in forward x = self.norm1(x + self._sa_block(x, tgt_mask, tgt_key_padding_mask)) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 460, in _sa_block x = self.self_attn(x, x, x, File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1003, in forward attn_output, attn_output_weights = F.multi_head_attention_forward( File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/functional.py", line 5016, in multi_head_attention_forward raise RuntimeError(f"The shape of the 3D attn_mask is {attn_mask.shape}, but should be {correct_3d_size}.") RuntimeError: The shape of the 3D attn_mask is torch.Size([4, 600, 600]), but should be (4, 601, 601)
and i check my wav file info
ffmpeg -i ~/wks/FaceFormer/demo/wav/fb6d70e9cd7b2bed30fa1504330180f3.wav ffmpeg version 4.3 Copyright (c) 2000-2020 the FFmpeg developers built with gcc 7.3.0 (crosstool-NG 1.23.0.449-a04d0) configuration: --prefix=/home/yaobin.li/soft/miniconda3/envs/wenet --cc=/opt/conda/conda-bld/ffmpeg_1597178665428/_build_env/bin/x86_64-conda_cos6-linux-gnu-cc --disable-doc --disable-openssl --enable-avresample --enable-gnutls --enable-hardcoded-tables --enable-libfreetype --enable-libopenh264 --enable-pic --enable-pthreads --enable-shared --disable-static --enable-version3 --enable-zlib --enable-libmp3lame libavutil 56. 51.100 / 56. 51.100 libavcodec 58. 91.100 / 58. 91.100 libavformat 58. 45.100 / 58. 45.100 libavdevice 58. 10.100 / 58. 10.100 libavfilter 7. 85.100 / 7. 85.100 libavresample 4. 0. 0 / 4. 0. 0 libswscale 5. 7.100 / 5. 7.100 libswresample 3. 7.100 / 3. 7.100 Guessed Channel Layout for Input Stream #0.0 : mono Input #0, wav, from '/home/yaobin.li/wks/FaceFormer/demo/wav/fb6d70e9cd7b2bed30fa1504330180f3.wav': Metadata: encoder : Lavf58.45.100 Duration: 00:01:37.94, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
So is it audio issues?
The default value of max_seq_len is 600. If you'd like to use longer audio, e.g., 1~3min,
in the "demo.py" file, please add:
from faceformer import PeriodicPositionalEncoding, init_biased_mask
model.PPE = PeriodicPositionalEncoding(args.feature_dim, period = args.period, max_seq_len=6000)
model.biased_mask = init_biased_mask(n_head = 4, max_seq_len = 6000, period=args.period)
after the line27: model.load_state_dict(.....).
thanks for your reply! After the modification according to your suggestion, another problem occurred, it will be killed by system
[1] 5863 killed python demo.py --model_name vocaset --wav_path "demo/wav/qq.wav" --dataset
I test my own wav file, cmd is
python demo.py --model_name vocaset --wav_path "demo/wav/fb6d70e9cd7b2bed30fa1504330180f3.wav" --dataset vocaset --vertice_dim 15069 --feature_dim 64 --period 30 --fps 30 --train_subjects "FaceTalk_170728_03272_TA FaceTalk_170904_00128_TA FaceTalk_170725_00137_TA FaceTalk_170915_00223_TA FaceTalk_170811_03274_TA FaceTalk_170913_03279_TA FaceTalk_170904_03276_TA FaceTalk_170912_03278_TA" --test_subjects "FaceTalk_170809_00138_TA FaceTalk_170731_00024_TA" --condition FaceTalk_170913_03279_TA --subject FaceTalk_170809_00138_TA
but output error as below
Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Traceback (most recent call last): File "demo.py", line 204, in <module> main() File "demo.py", line 200, in main test_model(args) File "demo.py", line 57, in test_model prediction = model.predict(audio_feature, template, one_hot) File "/evo_860/yaobin.li/workspace/FaceFormer/faceformer.py", line 157, in predict vertice_out = self.transformer_decoder(vertice_input, hidden_states, tgt_mask=tgt_mask, memory_mask=memory_mask) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 248, in forward output = mod(output, memory, tgt_mask=tgt_mask, File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 451, in forward x = self.norm1(x + self._sa_block(x, tgt_mask, tgt_key_padding_mask)) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 460, in _sa_block x = self.self_attn(x, x, x, File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1003, in forward attn_output, attn_output_weights = F.multi_head_attention_forward( File "/home/yaobin.li/soft/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/nn/functional.py", line 5016, in multi_head_attention_forward raise RuntimeError(f"The shape of the 3D attn_mask is {attn_mask.shape}, but should be {correct_3d_size}.") RuntimeError: The shape of the 3D attn_mask is torch.Size([4, 600, 600]), but should be (4, 601, 601)
and i check my wav file info
ffmpeg -i ~/wks/FaceFormer/demo/wav/fb6d70e9cd7b2bed30fa1504330180f3.wav ffmpeg version 4.3 Copyright (c) 2000-2020 the FFmpeg developers built with gcc 7.3.0 (crosstool-NG 1.23.0.449-a04d0) configuration: --prefix=/home/yaobin.li/soft/miniconda3/envs/wenet --cc=/opt/conda/conda-bld/ffmpeg_1597178665428/_build_env/bin/x86_64-conda_cos6-linux-gnu-cc --disable-doc --disable-openssl --enable-avresample --enable-gnutls --enable-hardcoded-tables --enable-libfreetype --enable-libopenh264 --enable-pic --enable-pthreads --enable-shared --disable-static --enable-version3 --enable-zlib --enable-libmp3lame libavutil 56. 51.100 / 56. 51.100 libavcodec 58. 91.100 / 58. 91.100 libavformat 58. 45.100 / 58. 45.100 libavdevice 58. 10.100 / 58. 10.100 libavfilter 7. 85.100 / 7. 85.100 libavresample 4. 0. 0 / 4. 0. 0 libswscale 5. 7.100 / 5. 7.100 libswresample 3. 7.100 / 3. 7.100 Guessed Channel Layout for Input Stream #0.0 : mono Input #0, wav, from '/home/yaobin.li/wks/FaceFormer/demo/wav/fb6d70e9cd7b2bed30fa1504330180f3.wav': Metadata: encoder : Lavf58.45.100 Duration: 00:01:37.94, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
So is it audio issues?
The default value of max_seq_len is 600. If you'd like to use longer audio, e.g., 1~3min, in the "demo.py" file, please add:
from faceformer import PeriodicPositionalEncoding, init_biased_mask
model.PPE = PeriodicPositionalEncoding(args.feature_dim, period = args.period, max_seq_len=6000)
model.biased_mask = init_biased_mask(n_head = 4, max_seq_len = 6000, period=args.period)
after the line27: model.load_state_dict(.....).thanks for your reply! After the modification according to your suggestion, another problem occurred, it will be killed by system
[1] 5863 killed python demo.py --model_name vocaset --wav_path "demo/wav/qq.wav" --dataset
I solved this problem according to this issue #2
I test my own wav file, cmd is
but output error as below
and i check my wav file info
So is it audio issues?