Fictionarry / ER-NeRF

[ICCV'23] Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis
https://fictionarry.github.io/ER-NeRF/
MIT License
1.02k stars 133 forks source link

Given groups=1, weight of size [32, 29, 3], expected input[8, 1024, 2] to have 29 channels, but got 1024 channels instead #28

Open angelandy opened 1 year ago

angelandy commented 1 year ago

hi,thank you for your project here is my error, i do not know which step is wrong

here is My command: python main.py data/obama/ --workspace trial_obama/ -O --test --test_train --aud data/1_hu.npy

and i got the message here:

root@d8e5bdfb3898:/data/ER-NeRF-main# python main.py data/obama/ --workspace trial_obama/ -O --test --test_train --aud data/1_hu.npy
Namespace(H=450, O=True, W=450, amb_aud_loss=1, amb_dim=2, amb_eye_loss=1, asr=False, asr_model='deepspeech', asr_play=False, asr_save_feats=False, asr_wav='', att=2, aud='data/1_hu.npy', bg_img='', bound=1, ckpt='latest', color_space='srgb', cuda_ray=True, data_range=[0, -1], density_thresh=10, density_thresh_torso=0.01, dt_gamma=0.00390625, emb=False, exp_eye=True, fbg=False, finetune_lips=False, fix_eye=-1, fovy=21.24, fp16=True, fps=50, gui=False, head_ckpt='', ind_dim=4, ind_dim_torso=8, ind_num=10000, init_lips=False, iters=200000, l=10, lambda_amb=0.0001, lr=0.01, lr_net=0.001, m=50, max_ray_batch=4096, max_spp=1, max_steps=16, min_near=0.05, num_rays=65536, num_steps=16, offset=[0, 0, 0], part=False, part2=False, patch_size=1, path='data/obama/', preload=0, r=10, radius=3.35, scale=4, seed=0, smooth_eye=False, smooth_lips=False, smooth_path=False, smooth_path_window=7, test=True, test_train=True, torso=False, torso_shrink=0.8, train_camera=False, unc_loss=1, update_extra_interval=16, upsample_steps=0, warmup_step=10000, workspace='trial_obama/')
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/opt/conda/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  warnings.warn(
/opt/conda/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Loading model from: /opt/conda/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/opt/conda/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  warnings.warn(
/opt/conda/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Loading model from: /opt/conda/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth
[INFO] Trainer: ngp | 2023-08-25_10-10-16 | cuda | fp16 | trial_obama/
[INFO] #parameters: 587989
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
[INFO] load 7272 train frames.
[INFO] load data/1_hu.npy aud_features: torch.Size([371, 1024, 2])
Loading train data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7272/7272 [00:01<00:00, 6587.96it/s]
[INFO] eye_area: 0.0 - 1.0
==> Start Test, save results to trial_obama/results
  0% 0/371 [00:00<?, ?it/s]Traceback (most recent call last):
  File "main.py", line 206, in <module>
    trainer.test(test_loader)
  File "/data/ER-NeRF-main/nerf_triplane/utils.py", line 1023, in test
    preds, preds_depth = self.test_step(data)
  File "/data/ER-NeRF-main/nerf_triplane/utils.py", line 939, in test_step
    outputs = self.model.render(rays_o, rays_d, auds, bg_coords, poses, eye=eye, index=index, staged=True, bg_color=bg_color, perturb=perturb, **vars(self.opt))
  File "/data/ER-NeRF-main/nerf_triplane/renderer.py", line 675, in render
    results = _run(rays_o, rays_d, auds, bg_coords, poses, **kwargs)
  File "/data/ER-NeRF-main/nerf_triplane/renderer.py", line 188, in run_cuda
    enc_a = self.encode_audio(auds) # [1, 64]
  File "/data/ER-NeRF-main/nerf_triplane/network.py", line 232, in encode_audio
    enc_a = self.audio_net(a) # [1/8, 64]
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/ER-NeRF-main/nerf_triplane/network.py", line 64, in forward
    x = self.encoder_conv(x).squeeze(-1)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 309, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 305, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [32, 29, 3], expected input[8, 1024, 2] to have 29 channels, but got 1024 channels instead
  0% 0/371 [00:00<?, ?it/s]
Galaxy-AI commented 1 year ago

+1,my message python main.py data/obama/ --workspace /train/trial_obama/ -O --iters 100000 Namespace(path='data/obama/', O=True, test=False, test_train=False, data_range=[0, -1], workspace='/train/trial_obama/', seed=0, iters=100000, lr=0.01, lr_net=0.001, ckpt='latest', num_rays=65536, cuda_ray=True, max_steps=16, num_steps=16, upsample_steps=0, update_extra_interval=16, max_ray_batch=4096, warmup_step=10000, amb_aud_loss=1, amb_eye_loss=1, unc_loss=1, lambda_amb=0.0001, fp16=True, bg_img='', fbg=False, exp_eye=True, fix_eye=-1, smooth_eye=False, torso_shrink=0.8, color_space='srgb', preload=0, bound=1, scale=4, offset=[0, 0, 0], dt_gamma=0.00390625, min_near=0.05, density_thresh=10, density_thresh_torso=0.01, patch_size=1, init_lips=False, finetune_lips=False, smooth_lips=False, torso=False, head_ckpt='', gui=False, W=450, H=450, radius=3.35, fovy=21.24, max_spp=1, att=2, aud='', emb=False, ind_dim=4, ind_num=10000, ind_dim_torso=8, amb_dim=2, part=False, part2=False, train_camera=False, smooth_path=False, smooth_path_window=7, asr=False, asr_wav='', asr_play=False, asr_model='yy', asr_save_feats=False, fps=50, l=10, m=50, r=10) [INFO] load 7272 train frames. [INFO] load aud_features: torch.Size([7999, 29, 16]) Loading train data: 100%|██████████████████████████████████████| 7272/7272 [00:00<00:00, 9923.96it/s] [INFO] eye_area: 0.0 - 1.0 Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off] /root/anaconda3/envs/geneface/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /root/anaconda3/envs/geneface/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=AlexNet_Weights.IMAGENET1K_V1. You can also use weights=AlexNet_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) Loading model from: /root/anaconda3/envs/geneface/lib/python3.9/site-packages/lpips/weights/v0.1/alex.pth Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off] Loading model from: /root/anaconda3/envs/geneface/lib/python3.9/site-packages/lpips/weights/v0.1/alex.pth [INFO] Trainer: ngp | 2023-09-02_05-36-15 | cuda | fp16 | /train/trial_obama/ [INFO] #parameters: 588277 [INFO] Loading latest checkpoint ... [WARN] No checkpoint found, model randomly initialized. [INFO] load 100 val frames. [INFO] load aud_features: torch.Size([7999, 29, 16]) Loading val data: 100%|█████████████████████████████████████████| 100/100 [00:00<00:00, 10408.48it/s] [INFO] eye_area: 0.0 - 0.8050000071525574 [INFO] max_epoch = 14 ==> Start Training Epoch 1, lr=0.001000 ... 0% 0/7272 [00:00<?, ?it/s]Traceback (most recent call last): File "/www/wwwroot/AI/geneface/ER-NeRF-main/main.py", line 248, in trainer.train(train_loader, valid_loader, max_epochs) File "/www/wwwroot/AI/geneface/ER-NeRF-main/nerf_triplane/utils.py", line 983, in train self.train_one_epoch(train_loader) File "/www/wwwroot/AI/geneface/ER-NeRF-main/nerf_triplane/utils.py", line 1241, in train_one_epoch self.model.update_extra_state() File "/root/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/www/wwwroot/AI/geneface/ER-NeRF-main/nerf_triplane/renderer.py", line 432, in update_extra_state enc_a = self.encode_audio(auds) File "/www/wwwroot/AI/geneface/ER-NeRF-main/nerf_triplane/network.py", line 232, in encode_audio enc_a = self.audio_net(a) # [1/8, 64] File "/root/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/www/wwwroot/AI/geneface/ER-NeRF-main/nerf_triplane/network.py", line 64, in forward x = self.encoder_conv(x).squeeze(-1) File "/root/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/root/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward input = module(input) File "/root/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 313, in forward return self._conv_forward(input, self.weight, self.bias) File "/root/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: Given groups=1, weight of size [32, 32, 3], expected input[8, 29, 16] to have 32 channels, but got 29 channels instead 0% 0/7272 [00:00<?, ?it/s]

jacqueline-weng commented 1 year ago

I think it is because you use different audio feature extraction methods. HuBert would give 1024 dim feature while DeepSpeach gives something like [x, 29, 16].

baijiesong commented 1 year ago

hi,thank you for your project here is my error, i do not know which step is wrong

here is My command: python main.py data/obama/ --workspace trial_obama/ -O --test --test_train --aud data/1_hu.npy

and i got the message here:

root@d8e5bdfb3898:/data/ER-NeRF-main# python main.py data/obama/ --workspace trial_obama/ -O --test --test_train --aud data/1_hu.npy
Namespace(H=450, O=True, W=450, amb_aud_loss=1, amb_dim=2, amb_eye_loss=1, asr=False, asr_model='deepspeech', asr_play=False, asr_save_feats=False, asr_wav='', att=2, aud='data/1_hu.npy', bg_img='', bound=1, ckpt='latest', color_space='srgb', cuda_ray=True, data_range=[0, -1], density_thresh=10, density_thresh_torso=0.01, dt_gamma=0.00390625, emb=False, exp_eye=True, fbg=False, finetune_lips=False, fix_eye=-1, fovy=21.24, fp16=True, fps=50, gui=False, head_ckpt='', ind_dim=4, ind_dim_torso=8, ind_num=10000, init_lips=False, iters=200000, l=10, lambda_amb=0.0001, lr=0.01, lr_net=0.001, m=50, max_ray_batch=4096, max_spp=1, max_steps=16, min_near=0.05, num_rays=65536, num_steps=16, offset=[0, 0, 0], part=False, part2=False, patch_size=1, path='data/obama/', preload=0, r=10, radius=3.35, scale=4, seed=0, smooth_eye=False, smooth_lips=False, smooth_path=False, smooth_path_window=7, test=True, test_train=True, torso=False, torso_shrink=0.8, train_camera=False, unc_loss=1, update_extra_interval=16, upsample_steps=0, warmup_step=10000, workspace='trial_obama/')
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/opt/conda/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  warnings.warn(
/opt/conda/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Loading model from: /opt/conda/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/opt/conda/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  warnings.warn(
/opt/conda/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Loading model from: /opt/conda/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth
[INFO] Trainer: ngp | 2023-08-25_10-10-16 | cuda | fp16 | trial_obama/
[INFO] #parameters: 587989
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
[INFO] load 7272 train frames.
[INFO] load data/1_hu.npy aud_features: torch.Size([371, 1024, 2])
Loading train data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7272/7272 [00:01<00:00, 6587.96it/s]
[INFO] eye_area: 0.0 - 1.0
==> Start Test, save results to trial_obama/results
  0% 0/371 [00:00<?, ?it/s]Traceback (most recent call last):
  File "main.py", line 206, in <module>
    trainer.test(test_loader)
  File "/data/ER-NeRF-main/nerf_triplane/utils.py", line 1023, in test
    preds, preds_depth = self.test_step(data)
  File "/data/ER-NeRF-main/nerf_triplane/utils.py", line 939, in test_step
    outputs = self.model.render(rays_o, rays_d, auds, bg_coords, poses, eye=eye, index=index, staged=True, bg_color=bg_color, perturb=perturb, **vars(self.opt))
  File "/data/ER-NeRF-main/nerf_triplane/renderer.py", line 675, in render
    results = _run(rays_o, rays_d, auds, bg_coords, poses, **kwargs)
  File "/data/ER-NeRF-main/nerf_triplane/renderer.py", line 188, in run_cuda
    enc_a = self.encode_audio(auds) # [1, 64]
  File "/data/ER-NeRF-main/nerf_triplane/network.py", line 232, in encode_audio
    enc_a = self.audio_net(a) # [1/8, 64]
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/ER-NeRF-main/nerf_triplane/network.py", line 64, in forward
    x = self.encoder_conv(x).squeeze(-1)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 309, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 305, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [32, 29, 3], expected input[8, 1024, 2] to have 29 channels, but got 1024 channels instead
  0% 0/371 [00:00<?, ?it/s]

Can you solve this problem?

baijiesong commented 1 year ago

Hi,can you solve this problem?

Song950106 commented 1 year ago

It feels like that you miss the argument “--asr_model hubert”

danyow-cheung commented 10 months ago

The pretrained model is trained based on the deepspeech audio extractor ,that's why is 29 channels ,but the hubert is required 1024 channels. If you want to solve this question,maybe you have to train the model from scratch

XinBow99 commented 6 months ago

Run process.py with --asr_model hubert and Training again.