Closed Morizb closed 1 year ago
How do you set up the mmcv library? If you compile it locally, please check up whether your cuda/nvcc is enabled during compiling.
Thanks for your reply, I found the problem, when I run python mmdet3d/utils/collect_env.py, it shows TorchVision: 0.10.0+cu111 OpenCV: 4.8.0 MMCV: 1.2.7 MMCV Compiler: GCC 8.4 MMCV CUDA Compiler: not available MMDetection: 2.10.0 MMDetection3D: 0.11.0+
Hi, I modified the previous bug,
but when I continue to run sh . /tools/dist_train.sh . /configs/MSMDFusion_nusc_voxel_LC.py 2, it reports the following error:
The environment for installation is as follows: (msmd) xzluo@d037a065fa35:~/zc/MSMDFusion-main$ conda list
#
_libgcc_mutex 0.1 main https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
_sysroot_linux-64_curr_repodata_hack 3 haa98f57_10 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
absl-py 1.4.0
Do you know what the problem is, please?
Error "numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject" indicates that your numpy version is not compatible with another library, to solve this problem, you can refer to this site. However, since numpy is a foundation library of other libraries like torch, scipy, etc, modifying the numpy version will arouse more version conflicts. Therefore, I suggest you find the library incompatible with the current numpy version, or setup a new environment by referring to my environment details.
What is your graphics card model and memory? I can only apply two cards, the model is GeForce RTX 2080 Ti, the video memory is 11G, when I set samples_per_gpu=2, workers_per_gpu=2, it will report error when I run the code:
Do you know how to solve this issue?
We use RTX3090 with 24G memory. You can try some techniques (like fp16, pytorch checkpoint, etc.) for saving the GPU memory.
Hello, when I download the fusion_voxel0075_R50.pth you provided, and run sh . /tools/dist_train.sh . /configs/MSMDFusion_nusc_voxel_LC.py 2 for the 2-nd stage training, the error is reported as follows, tried some solutions on the Internet still did not get a solution, I hope you can point out, thank you!
2023-09-14 10:43:15,801 - mmdet - INFO - Start running, host: xzluo@b5163d5d11c9, work_dir: /public/home/xzluo/zc/MSMDFusion-main/work_dirs/MSMDFusion_nusc_voxel_LC 2023-09-14 10:43:15,801 - mmdet - INFO - workflow: [('train', 1)], max: 6 epochs Traceback (most recent call last): File "./tools/train.py", line 283, in
main()
File "./tools/train.py", line 272, in main
train_detector(
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/apis/train.py", line 170, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
epoch_runner(data_loaders[i], kwargs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 46, in train_step
output = self.module.train_step(inputs[0], kwargs[0])
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 247, in train_step
losses = self(data)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(input, kwargs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
return old_func(args, kwargs)
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/base.py", line 58, in forward
return self.forward_train(kwargs)
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/MSMDFusion.py", line 534, in forward_train
losses_pts = self.forward_pts_train(pts_feats, img_feats, gt_bboxes_3d,
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/MSMDFusion.py", line 574, in forward_pts_train
losses = self.pts_bbox_head.loss(loss_inputs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func
return old_func(*args, kwargs)
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/dense_heads/transfusion_head.py", line 1260, in loss
layer_loss_cls = self.loss_cls(layer_cls_score, layer_labels, layer_label_weights, avg_factor=max(num_pos, 1))
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, *kwargs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 170, in forward
loss_cls = self.loss_weight calculate_loss_func(
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 85, in sigmoid_focal_loss
loss = _sigmoid_focal_loss(pred.contiguous(), target, gamma, alpha, None,
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/ops/focal_loss.py", line 54, in forward
ext_module.sigmoid_focal_loss_forward(
RuntimeError: SigmoidFocalLoss is not compiled with GPU support
Traceback (most recent call last):
File "./tools/train.py", line 283, in
main()
File "./tools/train.py", line 272, in main
train_detector(
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/apis/train.py", line 170, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
epoch_runner(data_loaders[i], kwargs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 46, in train_step
output = self.module.train_step(inputs[0], kwargs[0])
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 247, in train_step
losses = self(data)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(input, kwargs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
return old_func(args, kwargs)
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/base.py", line 58, in forward
return self.forward_train(kwargs)
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/MSMDFusion.py", line 534, in forward_train
losses_pts = self.forward_pts_train(pts_feats, img_feats, gt_bboxes_3d,
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/MSMDFusion.py", line 574, in forward_pts_train
losses = self.pts_bbox_head.loss(loss_inputs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func
return old_func(*args, *kwargs)
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/dense_heads/transfusion_head.py", line 1260, in loss
layer_loss_cls = self.loss_cls(layer_cls_score, layer_labels, layer_label_weights, avg_factor=max(num_pos, 1))
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(input, kwargs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 170, in forward
loss_cls = self.loss_weight * calculate_loss_func(
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 85, in sigmoid_focal_loss
loss = _sigmoid_focal_loss(pred.contiguous(), target, gamma, alpha, None,
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/ops/focal_loss.py", line 54, in forward
ext_module.sigmoid_focal_loss_forward(
RuntimeError: SigmoidFocalLoss is not compiled with GPU support
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29983) of binary: /public/home/xzluo/anaconda3/envs/zc/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group