I got a problem when I use KITTI dataset to train your model

Machine97 commented 3 years ago

Thanks for sharing your research and code with us. Many appreciated. I tried to train your model with KITTI dataset, the following errors occur every time:

2021-03-01 09:19:43,961 - mmdet - INFO - Epoch [1][440/1856] lr: 5.067e-06, eta: 8:32:39, time: 0.400, data_time: 0.100, memory: 4264, loss_rpn_cls: 0.3128, loss_rpn_bbox: 0.2060, loss_cls: 0.2960, acc: 96.9043, loss_reg: 0.2682, loss_mask: 0.6795, loss: 1.7624 2021-03-01 09:19:47,923 - mmdet - INFO - Epoch [1][450/1856] lr: 5.177e-06, eta: 8:32:01, time: 0.396, data_time: 0.095, memory: 4264, loss_rpn_cls: 0.3051, loss_rpn_bbox: 0.1585, loss_cls: 0.2783, acc: 96.7188, loss_reg: 0.2841, loss_mask: 0.6796, loss: 1.7056 2021-03-01 09:19:51,800 - mmdet - INFO - Epoch [1][460/1856] lr: 5.286e-06, eta: 8:31:11, time: 0.388, data_time: 0.087, memory: 4264, loss_rpn_cls: 0.2838, loss_rpn_bbox: 0.1638, loss_cls: 0.2632, acc: 96.6406, loss_reg: 0.2799, loss_mask: 0.6799, loss: 1.6707 Traceback (most recent call last):
File "train.py", line 161, in
main()
File "train.py", line 157, in main
meta=meta) File "/work_dirs/D2Det_mmdet2.1/mmdet/apis/train.py", line 179, in train_detector runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run epoch_runner(data_loaders[i], *kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 43, in train self.call_hook('after_train_iter') File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 282, in call_hook getattr(hook, fn_name)(self) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 21, in after_train_iter runner.outputs['loss'].backward() File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/opt/conda/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: shape mismatch: value tensor of shape [8, 256, 7, 7] cannot be broadcast to indexing result of shape [9, 256, 7, 7] (make_index_put_iterator at /opt/conda/conda-bld/pytorch_1587428398394/work/aten/src/ATen/native/TensorAdvancedIndexing.cpp:215)frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f43abb90b5e in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: at::native::_index_putimpl(at::Tensor&, c10::ArrayRef, at::Tensor const&, bool, bool) + 0x712 (0x7f43d38d0b82 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #2: + 0xee23de (0x7f43d3c543de in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #3: at::native::indexput(at::Tensor&, c10::ArrayRef, at::Tensor const&, bool) + 0x135 (0x7f43d38c0255 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #4: + 0xee210e (0x7f43d3c5410e in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #5: + 0x288fa88 (0x7f43d5601a88 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #6: torch::autograd::generated::IndexPutBackward::apply(std::vector<at::Tensor, std::allocator >&&) + 0x251 (0x7f43d53cc201 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #7: + 0x2ae8215 (0x7f43d585a215 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #8: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node, torch::autograd::InputBuffer&) + 0x16f3 (0x7f43d5857513 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #9: torch::autograd::Engine::thread_main(std::shared_ptr const&, bool) + 0x3d2 (0x7f43d58582f2 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #10: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f43d5850969 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #11: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f43d8b97558 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #12: + 0xc819d (0x7f43db5ff19d in /opt/conda/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #13: + 0x76db (0x7f43fbfdf6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #14: clone + 0x3f (0x7f43fbd0888f in /lib/x86_64-linux-gnu/libc.so.6)

This error occurs randomly in different iteration. In addition, every time the error occured, the first dimension size of the tensor [8, 256, 7, 7] is different. Do you know the possible reasons for this error?

Machine97 commented 3 years ago

When the above error occurs, batchsize is 2. Once I set batchsize to 1, the following error will also occur apart from the above occur:

Traceback (most recent call last): File "/work_dirs/D2Det_mmdet2.1/tools/train.py", line 161, in main() File "/work_dirs/D2Det_mmdet2.1/tools/train.py", line 157, in main meta=meta) File "/work_dirs/D2Det_mmdet2.1/mmdet/apis/train.py", line 179, in train_detector runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run epoch_runner(data_loaders[i], *kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 43, in train self.call_hook('after_train_iter') File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 282, in call_hook getattr(hook, fn_name)(self) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 21, in after_train_iter runner.outputs['loss'].backward() File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/opt/conda/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: Function IndexPutBackward returned an invalid gradient at index 1 - got [353, 256, 7, 7] but expected shape compatible with [355, 256, 7, 7] (validate_outputs at /opt/conda/conda-bld/pytorch_1587428398394/work/torch/csrc/autograd/engine.cpp:472) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fc24d54fb5e in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: + 0x2ae3134 (0x7fc277214134 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #2: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node, torch::autograd::InputBuffer&) + 0x548 (0x7fc277215368 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #3: torch::autograd::Engine::thread_main(std::shared_ptr const&, bool) + 0x3d2 (0x7fc2772172f2 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #4: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fc27720f969 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #5: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fc27a556558 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #6: + 0xc819d (0x7fc28fae019d in /opt/conda/bin/../lib/libstdc++.so.6) frame #7: + 0x76db (0x7fc29e1746db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #8: clone + 0x3f (0x7fc29de9d88f in /lib/x86_64-linux-gnu/libc.so.6)

Process finished with exit code 1

Machine97 commented 3 years ago

@JialeCao001 The problem has been solved. The reason is that during the processing of KITTI data set, other classes except Car,Pedestrian, Cyclist and DontCare will be marked with -1. In version 2.1 of mmdetection, there is no error reminder of ' assertion`cur target > = 0&&cur target < n _ classes' failed'.

JialeCao001 / D2Det

I got a problem when I use KITTI dataset to train your model #38