garrickbrazil / M3D-RPN

MIT License
261 stars 67 forks source link

Error while training #32

Closed 111surajmaurya closed 4 years ago

111surajmaurya commented 4 years ago

Hi I am getting this error while training. I am following the exact steps mentioned for training, i am able to perform inference but not training.

cmmd- python scripts/train_rpn_3d.py --config=kitti_3d_multi_warmup

File "scripts/train_rpn_3d.py", line 198, in main(sys.argv[1:]) File "scripts/train_rpn_3d.py", line 124, in main cls, prob, bbox_2d, bbox_3d, feat_size = rpn_net(images) File "/root/utils/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/root/utils/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.gather(outputs, self.output_device) File "/root/utils/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather return gather(outputs, output_device, dim=self.dim) File "/root/utils/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/root/utils/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(outputs))) File "/root/utils/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(outputs))) File "/root/utils/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(outputs))) TypeError: zip argument #1 must support iteration

Here is configuration of system and packaged Ubuntu- Ubuntu 18.04.3 LTS Cuda- 10.2 CuDNN - 7.6.5 torch - 1.4.0 python - 3.7.3

If i add os.environ["CUDA_VISIBLE_DEVICES"]="0" in training file (train_rpn_3d.py), then i don't get the above error but new error in next line. i.e.

Traceback (most recent call last): File "scripts/train_rpn_3d.py", line 198, in main(sys.argv[1:]) File "scripts/train_rpn_3d.py", line 127, in main det_loss, det_stats = criterion_det(cls, prob, bbox_2d, bbox_3d, imobjs, feat_size) File "/root/utils/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/workspace/M3D-RPN/lib/loss/rpn_3d.py", line 125, in forward src_anchors = self.anchors[rois[:, 4].type(torch.cuda.LongTensor), :] File "/root/utils/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 486, in array return self.numpy() TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Please let me know if this is because of some version issue or code error.

Thanks in advance.

111surajmaurya commented 4 years ago

Looks like the above error is because of version discrepancy of pytorch. The code is written in pytorch version 0.4.1 and i was using 1.4.0. But after installing pytorch 0.4.1 on cuda 9 (and adding os.environ["CUDA_VISIBLE_DEVICES"]="0" ) I was able to train the model.