NVIDIA / vid2vid

Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic video-to-video translation.
Other
8.61k stars 1.2k forks source link

Sometimes ran into RuntimeError: Given groups=1, weight of size [64, 18, 7, 7]... when training. #163

Closed sheiun closed 4 years ago

sheiun commented 4 years ago

After I train the model with follow parameters:

python train.py --name pose \
--dataroot datasets/pose --dataset_mode pose \
--input_nc 6 --ngf 64 --num_D 2 \
--resize_or_crop scaleHeight_and_scaledCrop --loadSize 288 --fineSize 256 \
--niter 5 --niter_decay 5 \
--n_frames_total 20 --max_t_step 4 \
--max_frames_per_gpu 8

Logs

(epoch: 8, iters: 18006, time: 4.986) D_T_fake0: 0.064 D_T_fake1: 0.228 D_T_real0: 0.167 D_T_real1: 0.120 D_fake: 0.161 D_real: 0.498 G_GAN: 2.842 G_GAN_Feat: 5.241 G_T_GAN>
(epoch: 8, iters: 18106, time: 5.215) D_T_fake0: 0.038 D_T_fake1: 0.124 D_T_real0: 0.072 D_T_real1: 0.155 D_fake: 0.494 D_real: 0.542 G_GAN: 2.327 G_GAN_Feat: 5.192 G_T_GAN>
(epoch: 8, iters: 18206, time: 5.250) D_T_fake0: 0.050 D_T_fake1: 0.053 D_T_real0: 0.139 D_T_real1: 0.108 D_fake: 0.361 D_real: 0.640 G_GAN: 2.527 G_GAN_Feat: 4.626 G_T_GAN>
(epoch: 8, iters: 18306, time: 5.295) D_T_fake0: 0.019 D_T_fake1: 0.229 D_T_real0: 0.278 D_T_real1: 0.049 D_fake: 0.573 D_real: 0.695 G_GAN: 2.337 G_GAN_Feat: 4.622 G_T_GAN>

Traceback

Traceback (most recent call last):
  File "train.py", line 148, in <module>
    train()
  File "train.py", line 55, in train
    fake_B, fake_B_raw, flow, weight, real_A, real_Bp, fake_B_last = modelG(input_A, input_B, inst_A, fake_B_prev_last)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/vid2vid/models/models.py", line 37, in forward
    outputs = self.model(*inputs, **kwargs, dummy_bs=self.pad_bs)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 141, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/vid2vid/models/vid2vid_model_G.py", line 133, in forward
    fake_B, fake_B_raw, flow, weight = self.generate_frame_train(netG, real_A_all, fake_B_prev, start_gpu, is_first_frame)
  File "/vid2vid/models/vid2vid_model_G.py", line 178, in generate_frame_train
    fake_B_feat, flow_feat, fake_B_fg_feat, use_raw_only)
  File "/vid2vid/models/networks.py", line 204, in forward
    downsample = self.model_down_seg(input) + self.model_down_img(img_prev)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 320, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size [64, 18, 7, 7], expected input[1, 12, 262, 198] to have 18 channels, but got 12 channels instead

I've tried to continue train this model but it still happended after 10000~100000 iterations.

sheiun commented 4 years ago
Traceback (most recent call last):
  File "train.py", line 148, in <module>
    train()
  File "train.py", line 55, in train
    fake_B, fake_B_raw, flow, weight, real_A, real_Bp, fake_B_last = modelG(input_A, input_B, inst_A, fake_B_prev_last)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/vid2vid/models/models.py", line 37, in forward
    outputs = self.model(*inputs, **kwargs, dummy_bs=self.pad_bs)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/vid2vid/models/vid2vid_model_G.py", line 133, in forward
    fake_B, fake_B_raw, flow, weight = self.generate_frame_train(netG, real_A_all, fake_B_prev, start_gpu, is_first_frame)        
  File "/vid2vid/models/vid2vid_model_G.py", line 178, in generate_frame_train
    fake_B_feat, flow_feat, fake_B_fg_feat, use_raw_only)
  File "/vid2vid/models/networks.py", line 204, in forward
    downsample = self.model_down_seg(input) + self.model_down_img(img_prev)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size [64, 9, 7, 7], expected input[1, 6, 262, 198] to have 9 channels, but got 6 channels instead
sheiun commented 4 years ago

I thought the problem is expected input of (1): Conv2d(9, 64, kernel_size=(7, 7), stride=(1, 1)) but my image is a 6-channel image but it doesn't happen all the time in most time my image is in 9-channel.

When printing self.model_down_seg:

Sequential(
  (0): ReflectionPad2d((3, 3, 3, 3))
  (1): Conv2d(9, 64, kernel_size=(7, 7), stride=(1, 1))
  (2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (3): ReLU(inplace=True)
  (4): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (5): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (6): ReLU(inplace=True)
  (7): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (8): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (9): ReLU(inplace=True)
  (10): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (11): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (12): ReLU(inplace=True)
  (13): ResnetBlock(
    (conv_block): Sequential(
      (0): ReflectionPad2d((1, 1, 1, 1))
      (1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1))
      (2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): ReLU(inplace=True)
      (4): ReflectionPad2d((1, 1, 1, 1))
      (5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1))
      (6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (14): ResnetBlock(
    (conv_block): Sequential(
      (0): ReflectionPad2d((1, 1, 1, 1))
      (1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1))
      (2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): ReLU(inplace=True)
      (4): ReflectionPad2d((1, 1, 1, 1))
      (5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1))
      (6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (15): ResnetBlock(
    (conv_block): Sequential(
      (0): ReflectionPad2d((1, 1, 1, 1))
      (1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1))
      (2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): ReLU(inplace=True)
      (4): ReflectionPad2d((1, 1, 1, 1))
      (5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1))
      (6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (16): ResnetBlock(
    (conv_block): Sequential(
      (0): ReflectionPad2d((1, 1, 1, 1))
      (1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1))
      (2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): ReLU(inplace=True)
      (4): ReflectionPad2d((1, 1, 1, 1))
      (5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1))
      (6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (17): ResnetBlock(
    (conv_block): Sequential(
      (0): ReflectionPad2d((1, 1, 1, 1))
      (1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1))
      (2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): ReLU(inplace=True)
      (4): ReflectionPad2d((1, 1, 1, 1))
      (5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1))
      (6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
)
sheiun commented 4 years ago

I found the problem is n_frames_load didn't match the channel size of real_A_all