Dimension mismatch - Githubissues

syomantak commented 4 years ago

Hey, I am working on RGB channeled MNIST data. I am trying to see if I can get a good inpainting model for some secondary applications. I am getting the following error -

ERROR Given transposed=1, weight of size [16, 128, 4, 4], expected input[1, 9, 3, 3] to have 16 channels, but got 9 channels instead
Traceback (most recent call last):
  File "train.py", line 173, in <module>
    main()
  File "train.py", line 169, in main
    raise e
  File "train.py", line 108, in main
    losses, coarse_result, inpainted_result = trainer(x, mask, ground_truth)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/Global-and-Local-Attention-Based-Free-Form-Image-Inpainting/scripts/trainer.py", line 39, in forward
    x1, x2 = self.netG(x, masks)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/Global-and-Local-Attention-Based-Free-Form-Image-Inpainting/model/network.py", line 22, in forward
    x_stage2 = self.fine_generator(x, x_stage1, mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/Global-and-Local-Attention-Based-Free-Form-Image-Inpainting/model/network.py", line 187, in forward
    x = self.contextul_attention(x, x, mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/Global-and-Local-Attention-Based-Free-Form-Image-Inpainting/model/Attention.py", line 164, in forward
    yi = F.conv_transpose2d(yi, wi_center, stride=self.rate, padding=1) / 4.  # (B=1, C=128, H=64, W=64)
RuntimeError: Given transposed=1, weight of size [16, 128, 4, 4], expected input[1, 9, 3, 3] to have 16 channels, but got 9 channels instead

I am attaching a part of the config file as well

# data parameters
dataset_name: MNIST
data_with_subfolder: False
train_data_path: training_data/training
resume: False
checkpoint_dir: ckp
batch_size: 4
image_shape: [28, 28, 3]
mask_shape: [18, 18]
mask_batch_same: True
max_delta_shape: [16, 16]
margin: [0, 0]
discounted_mask: True
spatial_discounting_gamma: 0.9
random_crop: True
mask_type: hole     # hole | mosaic
mosaic_unit_size: 4
save_image: 500

# training parameters
expname: benchmark
cuda: True
gpu_ids: [0]  # set the GPU ids to use, e.g. [0] or [1, 2]

Do you know what's cauing the error? This same error seems to be occuring in other similar inpainting networks as well! I confirmed that the images are ok by reading the images in the same way as the getitem method of the dataset class and I confirmed that the tensor is of shape [3,28,28].

A couple of more unrelated points - The train.py file should be in the main directory right? Both of these give module not found error; when in main directory - python scripts/train.py --config configs/config.yaml and when i scripts directory - python train.py --config configs/config.yaml.

There is a typo in model/network.py . In Conv2dBlock, the default argument should be pad_type='zeros' and not pad_type='zero'. Similarly, change the if else statements. The error was caused by self.conv = nn.Conv2d line below. padding_mode needs string zeros, not zero

SayedNadim commented 4 years ago

Hi, In line 35, Change raw_w = extract_image_patches(b, ksizes=[kernel, kernel], strides=[self.rate * self.stride, self.rate * self.stride], rates=[1, 1], padding='same') to raw_w = extract_image_patches(b, ksizes=[self.ksize, self.ksize], strides=[self.rate * self.stride, self.rate * self.stride], rates=[1, 1], padding='same')

Also, uncomment line 168 to match the padding.

Thanks for the typos!!

syomantak commented 4 years ago

Hello, Thanks for resolving that error, but now I am getting a different error!

Here
torch.Size([8, 256])
2020-06-30 08:33:32,022 ERROR size mismatch, m1: [8 x 256], m2: [16384 x 1] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:283
Traceback (most recent call last):
  File "train.py", line 173, in <module>
    main()
  File "train.py", line 169, in main
    raise e
  File "train.py", line 108, in main
    losses, coarse_result, inpainted_result = trainer(x, mask, ground_truth)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/Global-and-Local-Attention-Based-Free-Form-Image-Inpainting/scripts/trainer.py", line 44, in forward
    refine_real, refine_fake = self.dis_forward(self.globalD, ground_truth, x2_inpaint.detach())
  File "/content/Global-and-Local-Attention-Based-Free-Form-Image-Inpainting/scripts/trainer.py", line 68, in dis_forward
    batch_output = netD(batch_data)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/Global-and-Local-Attention-Based-Free-Form-Image-Inpainting/model/network.py", line 225, in forward
    x = self.linear(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1610, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: size mismatch, m1: [8 x 256], m2: [16384 x 1] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:283

The top 2 lines are results of print statement I inserted in the network.py file. Please let me know if you know the fix to this

SayedNadim commented 4 years ago

Hi, This is because the spatial size is not correct in the discriminator's linear layer. Change in line 215 in Network.py, self.linear = nn.Linear(self.cnum * 4 * 8 * 8, 1) to self.linear = nn.Linear(self.cnum * 4 * 1 * 1, 1) and you should be good to go!

syomantak commented 4 years ago

Hey, I tried what you suggested but that throws a different unrelated error it seems 😅

ERROR one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Traceback (most recent call last):
  File "train.py", line 173, in <module>
    main()
  File "train.py", line 169, in main
    raise e
  File "train.py", line 124, in main
    losses['g'].backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

SayedNadim commented 4 years ago

Hi, can I know your environment settings? In my environment (Pytorch 1.4.0, Cuda 10), it works. Let me try to reproduce the error in your environment. Or you can change all inplace operations of the activation functions (ReLU/ELU) to False. Specifically, in line 302 - 312 in Network.py. Cheers!

syomantak commented 4 years ago

Hi, It was a problem with colab (pytorch 1.5, cuda 10.1). I reset my runtime and a different error popped up.

ERROR shape '[4, 128, 7, 7]' is invalid for input of size 18432
Traceback (most recent call last):
  File "train.py", line 173, in <module>
    main()
  File "train.py", line 169, in main
    raise e
  File "train.py", line 108, in main
    losses, coarse_result, inpainted_result = trainer(x, mask, ground_truth)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/Global-and-Local-Attention-Based-Free-Form-Image-Inpainting/scripts/trainer.py", line 39, in forward
    x1, x2 = self.netG(x, masks)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/Global-and-Local-Attention-Based-Free-Form-Image-Inpainting/model/network.py", line 22, in forward
    x_stage2 = self.fine_generator(x, x_stage1, mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/Global-and-Local-Attention-Based-Free-Form-Image-Inpainting/model/network.py", line 187, in forward
    x = self.contextul_attention(x, x, mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/Global-and-Local-Attention-Based-Free-Form-Image-Inpainting/model/Attention.py", line 167, in forward
    y.contiguous().view(raw_int_fs)
RuntimeError: shape '[4, 128, 7, 7]' is invalid for input of size 18432

Let me share the link to the colab notebook I have added the edits you suggested in one of my repo and I am directly copying it there so it can run smoothly for anyone else. Let me know if you find anything!

SayedNadim commented 4 years ago

Let me reproduce the error. Edit#1 Please comment line 168. Let me know if you face any error. Line 168 is given for any padding changes required. In your case, I think no padding is required.

syomantak commented 4 years ago

@SayedNadim I am getting the gradient error again with uncommenting line 168. The notebook I shared, I had forgotten to uncomment that line so I was getting a the padding error. I have updated my repo and now you will be able to see the gradient error.

SayedNadim commented 4 years ago

Yes, I can observe the error in the colab. Can you please try this with PyTorch 1.4.0 and let me know?

syomantak commented 4 years ago

Hey, turns out it was a version problem with Pytorch. I am finally able to get the model to train. Thanks a lot for your help!

SayedNadim commented 4 years ago

No worries! I am closing this issue then. Cheers!

syomantak commented 4 years ago

@SayedNadim Hey, just wanted to let you know that equation (5) on your paper has a typo, at least in version linked in the repo.

SayedNadim / Global-and-Local-Attention-Based-Free-Form-Image-Inpainting

Dimension mismatch #2