andy-yun / pytorch-0.4-yolov3

Yet Another Implimentation of Pytroch 0.4.1 and YoloV3 on python3
MIT License
278 stars 72 forks source link

train error #65

Open sdustdk1427 opened 5 years ago

sdustdk1427 commented 5 years ago

When i get 000035.weights,then an error occured, i don't know why. I have set the image size in the cfg as 416*416.Pytorch version is 1.0.1.Please help me solve this issue,thank you very much.

2019-05-09 17:08:44 [035] training with 49.642771 samples/s 2019-05-09 17:08:44 save weights to backup/000035.weights

2019-05-09 17:08:44 [036] processed 133992 samples, lr 1.000000e-03 Traceback (most recent call last): File "train.py", line 375, in main() File "train.py", line 156, in main nsamples = train(epoch) File "train.py", line 219, in train for batch_idx, (data, target) in enumerate(train_loader): File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in next return self._process_next_batch(batch) File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch raise batch.exc_type(batch.exc_msg) RuntimeError: Traceback (most recent call last): File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate return [default_collate(samples) for samples in transposed] File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in return [default_collate(samples) for samples in transposed] File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 416 and 480 in dimension 2 at /opt/conda/conda-bld/pytorch_1550780889552/work/aten/src/TH/generic/THTensorMoreMath.cpp:1307

andy-yun commented 5 years ago

@sdustdk1427 same error to #55 I updated dataset.py and train.py. try the code. Refer to https://discuss.pytorch.org/t/runtimeerror-invalid-argument-0-sizes-of-tensors-must-match-except-in-dimension-0-got-3-and-2-in-dimension-1/23890/15

sdustdk1427 commented 5 years ago

Today,I use your new dataset.py and train.py,but when I get 000030.weights,I face this problem again! I refer this https://discuss.pytorch.org/t/runtimeerror-invalid-argument-0-sizes-of-tensors-must-match-except-in-dimension-0-got-3-and-2-in-dimension-1/23890/15,but I can't understand.....sorry...... so what should i do?thank you very very much.

2019-05-10 07:59:33 [030] training with 48.296028 samples/s 2019-05-10 07:59:33 save weights to backup2/000030.weights

interim evaluating ... 2019-05-10 08:01:59 [030] correct: 1004, precision: 0.327783, recall: 0.657929, fscore: 0.437564 done evaluation.

2019-05-10 08:01:59 [031] processed 147839 samples, lr 1.000000e-03 Traceback (most recent call last): File "train.py", line 377, in main() File "train.py", line 156, in main nsamples = train(epoch) File "train.py", line 221, in train for batch_idx, (data, target) in enumerate(train_loader): File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in next return self._process_next_batch(batch) File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch raise batch.exc_type(batch.exc_msg) RuntimeError: Traceback (most recent call last): File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/public/home/G19850028/RWJ/pytorch-0.4-yolov3-master/dataset.py", line 14, in custom_collate data = torch.stack([item[0] for item in batch], 0) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 416 and 512 in dimension 2 at /opt/conda/conda-bld/pytorch_1550780889552/work/aten/src/TH/generic/THTensorMoreMath.cpp:1307

andy-yun commented 5 years ago

@sdustdk1427 In that case, you can check the information as follows: in dataset.py, you expand data = torch.stack([item[0] for item in batch],0)

try:
  data = torch.stack([item[0] for item in batch],0)
except RuntimeError:
  import sys
  for item in batch:
        print(item[0].getbands())
        print(item[0].size())
  sys.exit(0)

maybe the image is not identically resized when training mode.

sdustdk1427 commented 5 years ago

I'd like to ask what the above code does.When I annotate def custom_collate(batch) out, I can run 000050.weight, but I still run into the same problem as before: 258900: Layer(106) nGT 80, nRC 64, nRC75 25, nPP 107, loss: box 2.187, conf 3.256, class 2.181, total 7.624

2019-05-11 13:07:08 [050] training with 29.621098 samples/s 2019-05-11 13:07:08 save weights to backup5/000050.weights

interim evaluating ... 2019-05-11 13:10:04 [050] correct: 919, precision: 0.369373, recall: 0.526346, fscore: 0.434100 done evaluation.

2019-05-11 13:10:04 [051] processed 264078 samples, lr 1.000000e-03 258964: Layer(082) nGT 105, nRC 78, nRC75 31, nPP 114, loss: box 2.332, conf 2.150, class 1.424, total 5.906 258964: Layer(094) nGT 105, nRC 68, nRC75 17, nPP 0, loss: box 2.786, conf 5.809, class 6.787, total 15.382 258964: Layer(106) nGT 105, nRC 80, nRC75 28, nPP 97, loss: box 2.600, conf 4.034, class 3.364, total 9.998 Traceback (most recent call last): File "train.py", line 377, in main() File "train.py", line 156, in main nsamples = train(epoch) File "train.py", line 221, in train for batch_idx, (data, target) in enumerate(train_loader): File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in next return self._process_next_batch(batch) File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch raise batch.exc_type(batch.exc_msg) RuntimeError: Traceback (most recent call last): File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate return [default_collate(samples) for samples in transposed] File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in return [default_collate(samples) for samples in transposed] File "/public/home/G19850028/zheng/Anacoda3/public/home/G19850028/anacoda35/envs/pytorch1.0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 416 and 448 in dimension 2 at /opt/conda/conda-bld/pytorch_1550780889552/work/aten/src/TH/generic/THTensorMoreMath.cpp:1307 what should i do?

andy-yun commented 5 years ago

@sdustdk1427 If you comment out "def custom_collate", then default collate_fn is used. Then this phenomenon is exactly same to the first condition (without collate_fn). custom_collate function is used for checking the different size or image types. I don't know exact condition of your environment. I am wondering that your experimental condition is messed or there are some bugs in my code. If you have same problem continuously, I recommend other repo published in github. Thanks.