Tessellate-Imaging / Monk_Object_Detection

A one-stop repository for low-code easily-installable object detection pipelines.
Apache License 2.0
631 stars 249 forks source link

Error in Resume training module of 4_efficientdet, getting after completing 5 epoch. #56

Open waghts95 opened 4 years ago

waghts95 commented 4 years ago

I am using torch 1.6.0 , efficientnet-pytorch-0.6.3, tensorboardX-2.1

This is my code

`from train_detector import Detector gtf = Detector()

directs the model towards file structure

root_dir = "./" coco_dir = "cellphone" img_dir = "./" set_dir = "Images"

smells like some free compute from Colab, nice

gtf.Train_Dataset(root_dir, coco_dir, img_dir, set_dir, batch_size=8, image_size=32, use_gpu=True) gtf.Model(model_name="efficientnet-b0",load_pretrained_model_from="/content/trained/signatrix_efficientdet_coco.pth")

gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0) gtf.Train(num_epochs=50, model_output_dir="trained/");`

My error is

Epoch: 1/50. Iteration: 910/910. Cls loss: 0.12021. Reg loss: 0.26245. Batch loss: 0.38265 Total loss: 0.50293 100% 910/910 [24:24<00:00, 1.58s/it]

/content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:251: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if len(inputs) == 2: /content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:84: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! image_shape = np.array(image_shape) /content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:96: TracerWarning: torch.from_numpy results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect. anchors = torch.from_numpy(all_anchors.astype(np.float32)) /content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:282: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if scores_over_thresh.sum() == 0: Epoch: 2/50. Iteration: 910/910. Cls loss: 0.17044. Reg loss: 0.19580. Batch loss: 0.36624 Total loss: 0.48137 100% 910/910 [24:31<00:00, 1.57s/it]

Epoch: 3/50. Iteration: 910/910. Cls loss: 0.22575. Reg loss: 0.32424. Batch loss: 0.54999 Total loss: 0.46841 100% 910/910 [24:36<00:00, 1.60s/it]

Epoch: 4/50. Iteration: 910/910. Cls loss: 0.13469. Reg loss: 0.25157. Batch loss: 0.38626 Total loss: 0.45206 100% 910/910 [24:40<00:00, 1.57s/it]

Epoch: 5/50. Iteration: 910/910. Cls loss: 0.24624. Reg loss: 0.34335. Batch loss: 0.58959 Total loss: 0.44057 100% 910/910 [23:59<00:00, 1.54s/it]

Epoch: 6/50. Iteration: 910/910. Cls loss: 0.20909. Reg loss: 0.26789. Batch loss: 0.47698 Total loss: 0.42917 100% 910/910 [23:53<00:00, 1.52s/it]

/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py:253: UserWarning: You are trying to export the model with onnx:Upsample for ONNX opset version 9. This operator might cause results to not match the expected results by PyTorch. ONNX's Upsample/Resize operator did not match Pytorch's Interpolation until opset 11. Attributes to determine how to transform the input were added in onnx:Resize in opset 11 to support Pytorch's behavior (like coordinate_transformation_mode and nearest_mode). We recommend using opset 11 and above for models using this operator. "" + str(_export_onnx_opset_version) + ". "

RuntimeError Traceback (most recent call last)

in () 1 gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0) ----> 2 gtf.Train(num_epochs=50, model_output_dir="trained/"); 9 frames /usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py in _onnx_opset_unsupported(op_name, current_opset, supported_opset) 184 def _onnx_opset_unsupported(op_name, current_opset, supported_opset): 185 raise RuntimeError('Unsupported: ONNX export of {} in ' --> 186 'opset {}. Please try opset version {}.'.format(op_name, current_opset, supported_opset)) 187 188 RuntimeError: Unsupported: ONNX export of index_put in opset 9. Please try opset version 11.
abhi-kumar commented 4 years ago

Thank you for pointing out the issue. We will try to resolve it as soon as possible. On your end please check by downgrading pytorch to version 1.4

waghts95 commented 4 years ago

Okay

Best regards, Tushar Wagh +91 9890132816

On Fri, Aug 21, 2020, 22:06 Abhishek Kumar Annamraju < notifications@github.com> wrote:

Thank you for pointing out the issue. We will try to resolve it as soon as possible. On your end please check by downgrading pytorch to version 1.4

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Tessellate-Imaging/Monk_Object_Detection/issues/56#issuecomment-678382009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3ISNMZT56G5LASS2CF4WTSB2O7FANCNFSM4QG6UJQQ .

abhi-kumar commented 4 years ago

Did a version downgrade help your case?

waghts95 commented 4 years ago

Not tried yet.

Best regards, Tushar Wagh +91 9890132816

On Mon, Aug 24, 2020, 16:09 Abhishek Kumar Annamraju < notifications@github.com> wrote:

Did a version downgrade help your case?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Tessellate-Imaging/Monk_Object_Detection/issues/56#issuecomment-679051382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3ISNNHW6AWYQSBUBNJPHDSCI7PPANCNFSM4QG6UJQQ .

abhi-kumar commented 4 years ago

We are unable to reproduce that error with pytorch v1.4. Please check and let us know

waghts95 commented 4 years ago

Okay. Some time I get error at epoch 5 and sometime at epoch 12.

Best regards, Tushar Wagh +91 9890132816

On Mon, Aug 24, 2020, 16:34 Abhishek Kumar Annamraju < notifications@github.com> wrote:

We are unable to reproduce that error with pytorch v1.4. Please check and let us know

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Tessellate-Imaging/Monk_Object_Detection/issues/56#issuecomment-679061339, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3ISNKTX6Y55SGBSG6OTLDSCJCMZANCNFSM4QG6UJQQ .

abhi-kumar commented 4 years ago

the error is because onnx is still incompatible with torch 1.6; Hence reducing torch to 1.4 and torchvision 0.5 will resolve the errors. Requirement files have been updated accordingly.

waghts95 commented 4 years ago

Thanks.

Best regards, Tushar Wagh +91 9890132816

On Mon, Aug 24, 2020, 21:30 Abhishek Kumar Annamraju < notifications@github.com> wrote:

the error is because onnx is still incompatible with torch 1.6; Hence reducing torch to 1.4 and torchvision 0.5 will resolve the errors. Requirement files have been updated accordingly.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Tessellate-Imaging/Monk_Object_Detection/issues/56#issuecomment-679215419, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3ISNNZCQOCSNR4VAUGCHTSCKFBHANCNFSM4QG6UJQQ .

waghts95 commented 4 years ago

When I use torch 1.4 and torchvision 0.5, I am getting

loading annotations into memory... Done (t=0.13s) creating index... index created!

RuntimeError Traceback (most recent call last)

in () 8 #smells like some free compute from Colab, nice 9 gtf.Train_Dataset(root_dir, coco_dir, img_dir, set_dir, batch_size=8, image_size=32, use_gpu=True) ---> 10 gtf.Model(model_name="efficientnet-b0",load_pretrained_model_from="/content/trained/signatrix_efficientdet_coco.pth") 2 frames /usr/local/lib/python3.6/dist-packages/torch/serialization.py in __init__(self, name_or_buffer) 222 class _open_zipfile_reader(_opener): 223 def __init__(self, name_or_buffer): --> 224 super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer)) 225 226 RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f5933aff193 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so) frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7f5936c879eb in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so) frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7f5936c88c04 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so) frame #3: + 0x6c53a6 (0x7f597ebb83a6 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so) frame #4: + 0x2961c4 (0x7f597e7891c4 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so) frame #5: _PyCFunction_FastCallDict + 0x35c (0x566ddc in /usr/bin/python3) frame #6: /usr/bin/python3() [0x594b71] frame #7: /usr/bin/python3() [0x54a325] frame #8: /usr/bin/python3() [0x5517c1] frame #9: _PyObject_FastCallKeywords + 0x19c (0x5a9eec in /usr/bin/python3) frame #10: /usr/bin/python3() [0x50a783] frame #11: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #12: /usr/bin/python3() [0x507f24] frame #13: _PyFunction_FastCallDict + 0x2e2 (0x509202 in /usr/bin/python3) frame #14: /usr/bin/python3() [0x594b01] frame #15: /usr/bin/python3() [0x54a17f] frame #16: /usr/bin/python3() [0x5517c1] frame #17: _PyObject_FastCallKeywords + 0x19c (0x5a9eec in /usr/bin/python3) frame #18: /usr/bin/python3() [0x50a783] frame #19: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #20: /usr/bin/python3() [0x507f24] frame #21: /usr/bin/python3() [0x509c50] frame #22: /usr/bin/python3() [0x50a64d] frame #23: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #24: /usr/bin/python3() [0x507f24] frame #25: /usr/bin/python3() [0x509c50] frame #26: /usr/bin/python3() [0x50a64d] frame #27: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3) frame #28: /usr/bin/python3() [0x507f24] frame #29: /usr/bin/python3() [0x5165a5] frame #30: /usr/bin/python3() [0x50a47f] frame #31: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #32: /usr/bin/python3() [0x507f24] frame #33: /usr/bin/python3() [0x509c50] frame #34: /usr/bin/python3() [0x50a64d] frame #35: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #36: /usr/bin/python3() [0x507f24] frame #37: /usr/bin/python3() [0x509c50] frame #38: /usr/bin/python3() [0x50a64d] frame #39: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3) frame #40: /usr/bin/python3() [0x507f24] frame #41: _PyFunction_FastCallDict + 0x2e2 (0x509202 in /usr/bin/python3) frame #42: /usr/bin/python3() [0x594b01] frame #43: PyObject_Call + 0x3e (0x59fe1e in /usr/bin/python3) frame #44: _PyEval_EvalFrameDefault + 0x17e6 (0x50d596 in /usr/bin/python3) frame #45: /usr/bin/python3() [0x507f24] frame #46: /usr/bin/python3() [0x509c50] frame #47: /usr/bin/python3() [0x50a64d] frame #48: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3) frame #49: /usr/bin/python3() [0x507f24] frame #50: /usr/bin/python3() [0x509c50] frame #51: /usr/bin/python3() [0x50a64d] frame #52: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #53: /usr/bin/python3() [0x509918] frame #54: /usr/bin/python3() [0x50a64d] frame #55: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #56: /usr/bin/python3() [0x509918] frame #57: /usr/bin/python3() [0x50a64d] frame #58: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #59: /usr/bin/python3() [0x507f24] frame #60: /usr/bin/python3() [0x588e91] frame #61: PyObject_Call + 0x3e (0x59fe1e in /usr/bin/python3) frame #62: _PyEval_EvalFrameDefault + 0x17e6 (0x50d596 in /usr/bin/python3) frame #63: /usr/bin/python3() [0x507f24]
waghts95 commented 4 years ago

Earlier I was able to reach till epoch 5 or sometimes 13. But now training starts but after a minute I get this ( Not using torch == 1.4 and torchvision == 0.5 as with this training does not start and directly gives above error)

100% 910/910 [01:55<00:00, 7.89it/s] The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) /content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:297: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if len(inputs) == 2: /content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:84: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! image_shape = np.array(image_shape) /content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:96: TracerWarning: torch.from_numpy results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect. anchors = torch.from_numpy(all_anchors.astype(np.float32)) /content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:328: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if scores_over_thresh.sum() == 0: /usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py:253: UserWarning: You are trying to export the model with onnx:Upsample for ONNX opset version 9. This operator might cause results to not match the expected results by PyTorch. ONNX's Upsample/Resize operator did not match Pytorch's Interpolation until opset 11. Attributes to determine how to transform the input were added in onnx:Resize in opset 11 to support Pytorch's behavior (like coordinate_transformation_mode and nearest_mode). We recommend using opset 11 and above for models using this operator. "" + str(_export_onnx_opset_version) + ". "

RuntimeError Traceback (most recent call last)

in () 1 gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0) ----> 2 gtf.Train(num_epochs=50, model_output_dir="trained1/") 9 frames /usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py in _onnx_opset_unsupported(op_name, current_opset, supported_opset) 184 def _onnx_opset_unsupported(op_name, current_opset, supported_opset): 185 raise RuntimeError('Unsupported: ONNX export of {} in ' --> 186 'opset {}. Please try opset version {}.'.format(op_name, current_opset, supported_opset)) 187 188 RuntimeError: Unsupported: ONNX export of index_put in opset 9. Please try opset version 11.
abhi-kumar commented 4 years ago

When I use torch 1.4 and torchvision 0.5, I am getting

loading annotations into memory...

Done (t=0.13s) creating index... index created!

RuntimeError Traceback (most recent call last) in () 8 #smells like some free compute from Colab, nice 9 gtf.Train_Dataset(root_dir, coco_dir, img_dir, set_dir, batch_size=8, image_size=32, use_gpu=True) ---> 10 gtf.Model(model_name="efficientnet-b0",load_pretrained_model_from="/content/trained/signatrix_efficientdet_coco.pth")

2 frames /usr/local/lib/python3.6/dist-packages/torch/serialization.py in init(self, name_or_buffer) 222 class _open_zipfile_reader(_opener): 223 def init(self, name_or_buffer): --> 224 super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer)) 225 226

RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f5933aff193 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so) frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7f5936c879eb in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so) frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7f5936c88c04 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so) frame #3: + 0x6c53a6 (0x7f597ebb83a6 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so) frame #4: + 0x2961c4 (0x7f597e7891c4 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so) frame #5: _PyCFunction_FastCallDict + 0x35c (0x566ddc in /usr/bin/python3) frame #6: /usr/bin/python3() [0x594b71] frame #7: /usr/bin/python3() [0x54a325] frame #8: /usr/bin/python3() [0x5517c1] frame #9: _PyObject_FastCallKeywords + 0x19c (0x5a9eec in /usr/bin/python3) frame #10: /usr/bin/python3() [0x50a783] frame #11: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #12: /usr/bin/python3() [0x507f24] frame #13: _PyFunction_FastCallDict + 0x2e2 (0x509202 in /usr/bin/python3) frame #14: /usr/bin/python3() [0x594b01] frame #15: /usr/bin/python3() [0x54a17f] frame #16: /usr/bin/python3() [0x5517c1] frame #17: _PyObject_FastCallKeywords + 0x19c (0x5a9eec in /usr/bin/python3) frame #18: /usr/bin/python3() [0x50a783] frame #19: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #20: /usr/bin/python3() [0x507f24] frame #21: /usr/bin/python3() [0x509c50] frame #22: /usr/bin/python3() [0x50a64d] frame #23: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #24: /usr/bin/python3() [0x507f24] frame #25: /usr/bin/python3() [0x509c50] frame #26: /usr/bin/python3() [0x50a64d] frame #27: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3) frame #28: /usr/bin/python3() [0x507f24] frame #29: /usr/bin/python3() [0x5165a5] frame #30: /usr/bin/python3() [0x50a47f] frame #31: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #32: /usr/bin/python3() [0x507f24] frame #33: /usr/bin/python3() [0x509c50] frame #34: /usr/bin/python3() [0x50a64d] frame #35: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #36: /usr/bin/python3() [0x507f24] frame #37: /usr/bin/python3() [0x509c50] frame #38: /usr/bin/python3() [0x50a64d] frame #39: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3) frame #40: /usr/bin/python3() [0x507f24] frame #41: _PyFunction_FastCallDict + 0x2e2 (0x509202 in /usr/bin/python3) frame #42: /usr/bin/python3() [0x594b01] frame #43: PyObject_Call + 0x3e (0x59fe1e in /usr/bin/python3) frame #44: _PyEval_EvalFrameDefault + 0x17e6 (0x50d596 in /usr/bin/python3) frame #45: /usr/bin/python3() [0x507f24] frame #46: /usr/bin/python3() [0x509c50] frame #47: /usr/bin/python3() [0x50a64d] frame #48: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3) frame #49: /usr/bin/python3() [0x507f24] frame #50: /usr/bin/python3() [0x509c50] frame #51: /usr/bin/python3() [0x50a64d] frame #52: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #53: /usr/bin/python3() [0x509918] frame #54: /usr/bin/python3() [0x50a64d] frame #55: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #56: /usr/bin/python3() [0x509918] frame #57: /usr/bin/python3() [0x50a64d] frame #58: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3) frame #59: /usr/bin/python3() [0x507f24] frame #60: /usr/bin/python3() [0x588e91] frame #61: PyObject_Call + 0x3e (0x59fe1e in /usr/bin/python3) frame #62: _PyEval_EvalFrameDefault + 0x17e6 (0x50d596 in /usr/bin/python3) frame #63: /usr/bin/python3() [0x507f24]

Don't mixup versions when resuming training. Keep every training restricted to pytorch version 1.4 and torchvision version 0.5 starting from the very first training itself. Serializing a model trained in version 1.5 or 1.6 may not be possible in version 1.4.

waghts95 commented 4 years ago

Earlier I was able to reach till epoch 5 or sometimes 13. But now training starts but after a minute I get this ( Not using torch == 1.4 and torchvision == 0.5 as with this training does not start and directly gives above error)

100% 910/910 [01:55<00:00, 7.89it/s] The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3 The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.

out=out, **kwargs) /usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) /content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:297: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if len(inputs) == 2: /content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:84: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! image_shape = np.array(image_shape) /content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:96: TracerWarning: torch.from_numpy results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect. anchors = torch.from_numpy(all_anchors.astype(np.float32)) /content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:328: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if scores_over_thresh.sum() == 0: /usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py:253: UserWarning: You are trying to export the model with onnx:Upsample for ONNX opset version 9. This operator might cause results to not match the expected results by PyTorch. ONNX's Upsample/Resize operator did not match Pytorch's Interpolation until opset 11. Attributes to determine how to transform the input were added in onnx:Resize in opset 11 to support Pytorch's behavior (like coordinate_transformation_mode and nearest_mode). We recommend using opset 11 and above for models using this operator. "" + str(_export_onnx_opset_version) + ". " RuntimeError Traceback (most recent call last) in () 1 gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0) ----> 2 gtf.Train(num_epochs=50, model_output_dir="trained1/")

9 frames /usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py in _onnx_opset_unsupported(op_name, current_opset, supported_opset) 184 def _onnx_opset_unsupported(op_name, current_opset, supported_opset): 185 raise RuntimeError('Unsupported: ONNX export of {} in ' --> 186 'opset {}. Please try opset version {}.'.format(op_name, current_opset, supported_opset)) 187 188

RuntimeError: Unsupported: ONNX export of index_put in opset 9. Please try opset version 11.

Please let me know how can I deal with this error ?

abhi-kumar commented 4 years ago

WAY 1:

a) Switch to torch==1.4, torchvision==0.5 and efficientnet_pytorch==0.6.3 b) Train your first detector c) Then resume or reload training from this checkpoint.

WAY 2:

When you clone the library comment out the line number 393-396 and 400-403 in the file Monk_Object_Detection/4_efficientdet/lib/train_detector.py

These lines

 torch.onnx.export(self.system_dict["local"]["model"].module, dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

and

torch.onnx.export(self.system_dict["local"]["model"], dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)
waghts95 commented 4 years ago

WAY 2, did not work. For WAY 1, a) Switch to torch==1.4, torchvision==0.5 and efficientnet_pytorch==0.6.3 ====> Done b) Train your first detector =====> For this, training is executing but continuously getting this, 'The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3' and not showing training status like epoch details, loss details, etc.

abhi-kumar commented 4 years ago

Please share your code.

waghts95 commented 4 years ago

Shared.

abhi-kumar commented 4 years ago

The image size is 32? For EfficientNet - b0 image size should be 512. See this example - https://github.com/Tessellate-Imaging/Monk_Object_Detection/blob/master/example_notebooks/4_efficientdet/train%20-%20with%20validation%20dataset.ipynb

waghts95 commented 4 years ago

How earlier was working?

On Thu, Aug 27, 2020, 13:18 Abhishek Kumar Annamraju < notifications@github.com> wrote:

The image size is 32? For EfficientNet - b0 image size should be 512

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Tessellate-Imaging/Monk_Object_Detection/issues/56#issuecomment-681717266, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3ISNJM3M6PNGICJGTDA7DSCYFTZANCNFSM4QG6UJQQ .

abhi-kumar commented 4 years ago

If the image shapes were inconsistent it auto switched to default shapes. Since latest efficientnet_pytorch upgrade requires a manual input of shapes we have made the argument as a required entity and cannot take in inconsistencies.

abhi-kumar commented 4 years ago

Keep image shape as 512 with B0 version and the training engine will scale annotations accordingly.

waghts95 commented 4 years ago

''The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3' and not showing training status like epoch details, loss details, etc.'

This error is gone. Thank you very much.

waghts95 commented 4 years ago

WAY 1:

a) Switch to torch==1.4, torchvision==0.5 and efficientnet_pytorch==0.6.3 b) Train your first detector c) Then resume or reload training from this checkpoint.

WAY 2:

When you clone the library comment out the line number 393-396 and 400-403 in the file Monk_Object_Detection/4_efficientdet/lib/train_detector.py

These lines

 torch.onnx.export(self.system_dict["local"]["model"].module, dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

and

torch.onnx.export(self.system_dict["local"]["model"], dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

I used way 1 and could successfully train module and also resume training worked fine. Today when I again tried resume training, I got error which attached in text file. resume_training_error.txt

abhi-kumar commented 4 years ago

WAY 1:

a) Switch to torch==1.4, torchvision==0.5 and efficientnet_pytorch==0.6.3 b) Train your first detector c) Then resume or reload training from this checkpoint.

WAY 2:

When you clone the library comment out the line number 393-396 and 400-403 in the file Monk_Object_Detection/4_efficientdet/lib/train_detector.py

These lines

 torch.onnx.export(self.system_dict["local"]["model"].module, dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

and

torch.onnx.export(self.system_dict["local"]["model"], dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

Since you are using colab make sure the versioning done is correct.

And comment out the two lines mentioned in Way 2.

waghts95 commented 4 years ago

versioning is as per your colab_requirement.txt, also commenting did not help.

alsheabi commented 3 years ago

try to add these in way 2 opset_version=11 looks like this after added torch.onnx.export(self.system_dict["local"]["model"].module, dummy_input, os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"), verbose=False, opset_version=11)

alsheabi commented 3 years ago

Keep image shape as 512 with B0 version and the training engine will scale annotations accordingly.

Hello @abhi-kumar I used 786 for B2 but I got the same error. Any suggestion. The size of tensor a (49) must match the size of tensor b (48) at non-singleton dimension 3.

aritzLizoain commented 3 years ago

Keep image shape as 512 with B0 version and the training engine will scale annotations accordingly.

Hello @abhi-kumar I used 786 for B2 but I got the same error. Any suggestion. The size of tensor a (49) must match the size of tensor b (48) at non-singleton dimension 3.

I obtain the same error. It only disappears when I use image_size = 512, regardless of the chosen model version. E.g. image_size = 786 and model version B2 fails, while image_size = 512 and model version B2 works.

I tried modifying dummy_input from torch.rand(1, 3, 512, 512) to torch.rand(1, 3, image_size, image_size) in lines 387 and 452 of train_detector.py, but nothing changed.

abhi-kumar commented 3 years ago

Thank you for mentioning the issue.

The issue will be taken into consideration very soon (most probably post Christmas).

srihari12345 commented 3 years ago

@abhi-kumar i have finished 200 epochs with using '7_yolov3'. in that using train_detector.py. now i need to train for 200 more with weights saved how can i resume with this.

alsheabi commented 3 years ago

@abhi-kumar Any update for the issue?