Open dearleiii opened 6 years ago
Ideas to Try:
Analysis: when calling output = model(inputs) input was put into parallel When calling:
File "/usr/project/xtmp/superresoluter/approximator/model1/apxm.py", line 60, in forward
output = self.regressor(x)
Mismatched size:
size mismatch, m1: [2 x 1036320], m2: [1048576 x 256] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:249
To figure out is m1, / m2 matched with input size / model size ?
Model 1 structure:
leichen@gpu-compute2$ python3 load_model_test.py
DataParallel(
(module): APXM_conv3(
(main): Sequential(
(0): Conv2d(3, 8, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(1): LeakyReLU(negative_slope=0.2, inplace)
(2): Conv2d(8, 16, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(3): LeakyReLU(negative_slope=0.2, inplace)
(4): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(5): LeakyReLU(negative_slope=0.2, inplace)
)
(regressor): Sequential(
(0): Linear(in_features=1048576, out_features=256, bias=True)
(1): LeakyReLU(negative_slope=0.01)
(2): Linear(in_features=256, out_features=1, bias=True)
)
)
)
DataParallel(
(module): DataParallel(
(module): APXM_conv3(
(main): Sequential(
(0): Conv2d(3, 8, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(1): LeakyReLU(negative_slope=0.2, inplace)
(2): Conv2d(8, 16, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(3): LeakyReLU(negative_slope=0.2, inplace)
(4): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(5): LeakyReLU(negative_slope=0.2, inplace)
)
(regressor): Sequential(
(0): Linear(in_features=1048576, out_features=256, bias=True)
(1): LeakyReLU(negative_slope=0.01)
(2): Linear(in_features=256, out_features=1, bias=True)
)
)
)
)
32 128 256 1048576
in apex.py:
def forward(self, x):
x = x.float()
x = self.main(x)
print('inputs after model main: ', x.size())
#Reshape data to input to the input layer of the neural net
#Size changes from (36, 255, 510) to (1, flatten)
#Recall that the -1 infers this dimension from the other given \
dimension
x = x.view(x.size(0), -1)
print(x.size(1), self.regressor.weight.size(1))
#Computes the activation of the first fully connected layer
#Size changes from (1, flatten) to (1, 64)
output = self.regressor(x)
return output
Running results:
intpus: torch.Size([50, 3, 1020, 2040]) scores: torch.Size([50, 1])
inputs after model main: torch.Size([2, 32, 127, 255])
inputs after model main: torch.Size([2, 32, 127, 255])
inputs after model main: torch.Size([2, 32, 127, 255])
inputs after model main: torch.Size([2, 32, 127, 255])
32 127255 1036320
intpus: torch.Size([50, 3, 1020, 2040]) scores: torch.Size([50, 1])
input shape: torch.Size([2, 3, 1020, 2040])
input shape: torch.Size([2, 3, 1020, 2040])
input shape: torch.Size([2, 3, 1020, 2040])
input shape: torch.Size([2, 3, 1020, 2040])
inputs after model main: torch.Size([2, 32, 127, 255])
inputs after model main: torch.Size([2, 32, 127, 255])
inputs after model main: torch.Size([2, 32, 127, 255])
inputs after model main: torch.Size([2, 32, 127, 255])
When using 8 GPUs:
input shape: torch.Size([1, 3, 1020, 2040])
input shape: torch.Size([1, 3, 1020, 2040])
input shape: torch.Size([1, 3, 1020, 2040])
inputs after model main: torch.Size([1, 32, 127, 255])
inputs after model main: torch.Size([1, 32, 127, 255])
inputs after model main: torch.Size([1, 32, 127, 255])
inputs after model main: torch.Size([1, 32, 127, 255])
inputs after model main: torch.Size([1, 32, 127, 255])
inputs after model main: torch.Size([1, 32, 127, 255])
As far as I know, the size mismatch always comes from the network architecture like the number of convolution which not have the required size. My suggestion you can go to your network code. From your code I see output = sae(input) . You can see in class sae and check the size for every layer.
1st problem: the parallel outputs are not combined.
intpus: torch.Size([50, 3, 1020, 2040]) scores: torch.Size([50, 1])
input shape: torch.Size([13, 3, 1020, 2040])
input shape: torch.Size([12, 3, 1020, 2040])
inputs after model main: torch.Size([13, 32, 127, 255])
1036320
inputs after model main: torch.Size([12, 32, 127, 255])
1036320
File "/home/home2/leichen/.local/lib/python3.5/site-packages/torch/nn/functional.py", line 992, in linear
return torch.addmm(bias, input, weight.t())
RuntimeError: size mismatch, m1: [13 x 1036320], m2: [1048576 x 256] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:249
2nd problem: why 127, 255 appeared instead of 128, 256?