crash during training - Githubissues

binbinmeng commented 5 years ago

TRAINING - Epoch: [0][410/446] Time 0.602 (0.622) Data 0.000 (0.005) Loss 4.0999 (5.5282) Prec@1 2.344 (3.435) Prec@5 19.531 (14.536) TRAINING - Epoch: [0][420/446] Time 0.602 (0.622) Data 0.000 (0.005) Loss 4.1251 (5.4952) Prec@1 3.906 (3.459) Prec@5 20.312 (14.664) TRAINING - Epoch: [0][430/446] Time 0.611 (0.621) Data 0.000 (0.005) Loss 4.0770 (5.4635) Prec@1 3.125 (3.478) Prec@5 24.219 (14.813) TRAINING - Epoch: [0][440/446] Time 0.600 (0.621) Data 0.000 (0.005) Loss 4.0965 (5.4331) Prec@1 7.031 (3.515) Prec@5 19.531 (14.948) Traceback (most recent call last): File "main.py", line 305, in main() File "main.py", line 187, in main train_loader, model, criterion, epoch, optimizer) File "main.py", line 293, in train training=True, optimizer=optimizer) File "main.py", line 249, in forward output = model(inputs) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, kwargs) File "/workspace/pytorch-quantization/quantized.pytorch/models/resnet_quantized.py", line 148, in forward x = self.layer3(x) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, *kwargs) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(input, kwargs) File "/workspace/pytorch-quantization/quantized.pytorch/models/resnet_quantized.py", line 56, in forward out = self.bn1(out) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, *kwargs) File "/workspace/pytorch-quantization/quantized.pytorch/models/modules/quantize.py", line 272, in forward y = y.view(C, self.num_chunks, B H * W // self.num_chunks) RuntimeError: invalid argument 2: size '[256 x 16 x 134]' is invalid for input with 551936 elements at ../src/TH/THStorage.cpp:40

amjltc295 commented 5 years ago

I encountered the same issue. I think this is because the data is not a multiple of (C * self.num_chunks). It does not happen until the last step of training where the batch size is a bit different

amjltc295 commented 5 years ago

Seems to be a bug

eladhoffer / quantized.pytorch

crash during training #6