Morizb commented 1 year ago

Hello, when I download the fusion_voxel0075_R50.pth you provided, and run sh . /tools/dist_train.sh . /configs/MSMDFusion_nusc_voxel_LC.py 2 for the 2-nd stage training, the error is reported as follows, tried some solutions on the Internet still did not get a solution, I hope you can point out, thank you!

2023-09-14 10:43:15,801 - mmdet - INFO - Start running, host: xzluo@b5163d5d11c9, work_dir: /public/home/xzluo/zc/MSMDFusion-main/work_dirs/MSMDFusion_nusc_voxel_LC 2023-09-14 10:43:15,801 - mmdet - INFO - workflow: [('train', 1)], max: 6 epochs Traceback (most recent call last): File "./tools/train.py", line 283, in main() File "./tools/train.py", line 272, in main train_detector( File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/apis/train.py", line 170, in train_detector runner.run(data_loaders, cfg.workflow) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run epoch_runner(data_loaders[i], kwargs) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train self.run_iter(data_batch, train_mode=True) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 46, in train_step output = self.module.train_step(inputs[0], kwargs[0]) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 247, in train_step losses = self(data) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func return old_func(args, kwargs) File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/base.py", line 58, in forward return self.forward_train(kwargs) File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/MSMDFusion.py", line 534, in forward_train losses_pts = self.forward_pts_train(pts_feats, img_feats, gt_bboxes_3d, File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/MSMDFusion.py", line 574, in forward_pts_train losses = self.pts_bbox_head.loss(loss_inputs) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func return old_func(*args, kwargs) File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/dense_heads/transfusion_head.py", line 1260, in loss layer_loss_cls = self.loss_cls(layer_cls_score, layer_labels, layer_label_weights, avg_factor=max(num_pos, 1)) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 170, in forward loss_cls = self.loss_weight calculate_loss_func( File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 85, in sigmoid_focal_loss loss = _sigmoid_focal_loss(pred.contiguous(), target, gamma, alpha, None, File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/ops/focal_loss.py", line 54, in forward ext_module.sigmoid_focal_loss_forward( RuntimeError: SigmoidFocalLoss is not compiled with GPU support Traceback (most recent call last): File "./tools/train.py", line 283, in main() File "./tools/train.py", line 272, in main train_detector( File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/apis/train.py", line 170, in train_detector runner.run(data_loaders, cfg.workflow) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run epoch_runner(data_loaders[i], kwargs) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train self.run_iter(data_batch, train_mode=True) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 46, in train_step output = self.module.train_step(inputs[0], kwargs[0]) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 247, in train_step losses = self(data) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func return old_func(args, kwargs) File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/base.py", line 58, in forward return self.forward_train(kwargs) File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/MSMDFusion.py", line 534, in forward_train losses_pts = self.forward_pts_train(pts_feats, img_feats, gt_bboxes_3d, File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/MSMDFusion.py", line 574, in forward_pts_train losses = self.pts_bbox_head.loss(loss_inputs) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func return old_func(*args, *kwargs) File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/dense_heads/transfusion_head.py", line 1260, in loss layer_loss_cls = self.loss_cls(layer_cls_score, layer_labels, layer_label_weights, avg_factor=max(num_pos, 1)) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 170, in forward loss_cls = self.loss_weight * calculate_loss_func( File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 85, in sigmoid_focal_loss loss = _sigmoid_focal_loss(pred.contiguous(), target, gamma, alpha, None, File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/ops/focal_loss.py", line 54, in forward ext_module.sigmoid_focal_loss_forward( RuntimeError: SigmoidFocalLoss is not compiled with GPU support ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29983) of binary: /public/home/xzluo/anaconda3/envs/zc/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group

SxJyJay commented 1 year ago

How do you set up the mmcv library? If you compile it locally, please check up whether your cuda/nvcc is enabled during compiling.

Morizb commented 1 year ago

Thanks for your reply, I found the problem, when I run python mmdet3d/utils/collect_env.py, it shows TorchVision: 0.10.0+cu111 OpenCV: 4.8.0 MMCV: 1.2.7 MMCV Compiler: GCC 8.4 MMCV CUDA Compiler: not available MMDetection: 2.10.0 MMDetection3D: 0.11.0+

Morizb commented 1 year ago

Hi, I modified the previous bug, 725c7603607ff52e1ece8d5c519f7ac

but when I continue to run sh . /tools/dist_train.sh . /configs/MSMDFusion_nusc_voxel_LC.py 2, it reports the following error: 6acddd9bce1281e0d09aa8b179f2c52

The environment for installation is as follows: (msmd) xzluo@d037a065fa35:~/zc/MSMDFusion-main$ conda list

packages in environment at /public/home/xzluo/anaconda3/envs/msmd:

#

Name Version Build Channel

_libgcc_mutex 0.1 main https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main _sysroot_linux-64_curr_repodata_hack 3 haa98f57_10 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main absl-py 1.4.0 addict 2.4.0 aiofiles 22.1.0 aiosqlite 0.19.0 anyio 3.7.1 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.2.3 astor 0.8.1 attrs 23.1.0 Babel 2.12.1 backcall 0.2.0 beautifulsoup4 4.12.2 binutils_impl_linux-64 2.35.1 h27ae35d_9 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main binutils_linux-64 2.35.1 h454624a_30 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main black 23.3.0 blas 1.0 mkl https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main bleach 6.0.0 ca-certificates 2019.11.28 hecc5488_0 moussi cached-property 1.5.2 cachetools 4.2.4 ccimport 0.4.2 certifi 2019.11.28 py37_0 moussi cffi 1.15.1 charset-normalizer 3.2.0 click 8.1.7 comm 0.1.4 cumm-cu117 0.4.11 cycler 0.11.0 Cython 3.0.2 dataclasses 0.6 debugpy 1.7.0 decorator 5.1.1 defusedxml 0.7.1 deprecation 2.1.0 descartes 1.1.0 entrypoints 0.4 exceptiongroup 1.1.3 fastjsonschema 2.18.0 fire 0.5.0 flake8 5.0.4 fonttools 4.38.0 fqdn 1.5.1 future 0.18.3 gast 0.2.2 gcc_impl_linux-64 8.4.0 he7ac559_17 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main gcc_linux-64 8.4.0 he201b7d_30 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main google-auth 1.35.0 google-auth-oauthlib 0.4.6 google-pasta 0.2.0 grpcio 1.58.0 gxx_impl_linux-64 8.4.0 h9ce2e92_17 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main gxx_linux-64 8.4.0 h85ed34b_30 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main h5py 3.8.0 idna 3.4 imageio 2.27.0 importlib-metadata 4.2.0 importlib-resources 5.12.0 iniconfig 2.0.0 intel-openmp 2022.0.1 h06a4308_3633 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main ipdb 0.13.13 ipykernel 6.16.2 ipython 7.34.0 ipython-genutils 0.2.0 ipywidgets 8.1.1 isoduration 20.11.0 jedi 0.19.0 Jinja2 3.1.2 joblib 1.3.2 json5 0.9.14 jsonpointer 2.4 jsonschema 4.17.3 jupyter 1.0.0 jupyter-console 6.6.3 jupyter-events 0.6.3 jupyter-server 1.24.0 jupyter-ydoc 0.2.5 jupyter_client 7.4.9 jupyter_core 4.12.0 jupyter_packaging 0.12.3 jupyter_server_fileid 0.9.0 jupyter_server_ydoc 0.8.0 jupyterlab 3.6.5 jupyterlab-pygments 0.2.2 jupyterlab-widgets 3.0.9 jupyterlab_server 2.24.0 Keras-Applications 1.0.8 Keras-Preprocessing 1.1.2 kernel-headers_linux-64 3.10.0 h57e8cba_10 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main kiwisolver 1.4.5 lark 1.1.7 ld_impl_linux-64 2.35.1 h7274673_9 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main libffi 3.2.1 he1b5a44_1007 moussi libgcc-devel_linux-64 8.4.0 hd257e2f_17 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main libgcc-ng 9.1.0 hdf63c60_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main libgfortran-ng 7.3.0 hdf63c60_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main libgomp 11.2.0 h1234567_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main libstdcxx-devel_linux-64 8.4.0 hf0c5c8d_17 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main libstdcxx-ng 9.1.0 hdf63c60_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main llvmlite 0.31.0 lyft-dataset-sdk 0.0.8 Markdown 3.3.4 MarkupSafe 2.1.3 matplotlib 3.5.2 matplotlib-inline 0.1.6 mccabe 0.7.0 mistune 3.0.1 mkl 2019.4 243 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main mkl-service 2.3.0 py37he8ac12f_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main mkl_fft 1.0.14 py37hd81dba3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r mkl_random 1.0.4 py37hd81dba3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r mmcv-full 1.2.7 mmdet 2.10.0 mmdet3d 0.11.0 mmpycocotools 12.0.3 mypy-extensions 1.0.0 nbclassic 1.0.0 nbclient 0.7.4 nbconvert 7.6.0 nbformat 5.8.0 ncurses 6.3 h7f8727e_2 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main nest-asyncio 1.5.7 networkx 2.2 ninja 1.11.1 notebook 6.5.5 notebook_shim 0.2.3 numba 0.48.0 numpy 1.17.0 py37h7e9f1db_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r numpy 1.19.5 numpy-base 1.17.0 py37hde5b4d6_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r nuscenes-devkit 1.1.10 oauthlib 3.2.2 open3d 0.13.0 opencv-python 4.5.5.64 openssl 1.1.1e h516909a_0 moussi opt-einsum 3.3.0 packaging 23.1 pandas 1.3.5 pandocfilters 1.5.0 parso 0.8.3 pathspec 0.11.2 pccm 0.4.8 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.5.0 pip 22.3.1 py37h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main pkgutil_resolve_name 1.3.10 platformdirs 3.10.0 plotly 5.16.1 pluggy 1.2.0 plyfile 0.8.1 portalocker 2.7.0 prometheus-client 0.17.1 prompt-toolkit 3.0.39 protobuf 4.24.3 psutil 5.9.5 ptyprocess 0.7.0 pyasn1 0.5.0 pyasn1-modules 0.3.0 pybind11 2.11.1 pycodestyle 2.9.1 pycparser 2.21 pyflakes 2.5.0 Pygments 2.16.1 pyparsing 3.1.1 pyquaternion 0.9.9 pyrsistent 0.19.3 pytest 7.4.2 python 3.7.7 hcf32534_0_cpython https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main python-dateutil 2.8.2 python-json-logger 2.0.7 pytz 2023.3.post1 PyWavelets 1.3.0 PyYAML 6.0.1 pyzmq 24.0.1 qtconsole 5.4.4 QtPy 2.4.0 readline 8.1.2 h7f8727e_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main requests 2.31.0 requests-oauthlib 1.3.1 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rsa 4.9 scikit-image 0.19.3 scikit-learn 1.0.2 scipy 1.4.1 Send2Trash 1.8.2 setuptools 65.6.3 py37h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main Shapely 1.8.5 six 1.16.0 pyhd3eb1b0_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main sniffio 1.3.0 soupsieve 2.4.1 spconv-cu117 2.3.6 sqlite 3.38.5 hc218d9a_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main sysroot_linux-64 2.17 h57e8cba_10 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main tenacity 8.2.3 tensorboard 2.1.1 tensorflow-estimator 2.1.0 tensorflow-gpu 2.1.0 termcolor 2.3.0 terminado 0.17.1 terminaltables 3.1.10 threadpoolctl 3.1.0 tifffile 2021.11.2 tinycss2 1.2.1 tk 8.6.12 h1ccaba5_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main tomli 2.0.1 tomlkit 0.12.1 torch 1.7.0+cu110 torch-scatter 2.0.7 torchaudio 0.7.0 torchvision 0.8.1+cu110 tornado 6.2 tqdm 4.66.1 traitlets 5.9.0 trimesh 2.35.39 typed-ast 1.5.5 typing_extensions 4.7.1 uri-template 1.3.0 urllib3 2.0.4 waymo-open-dataset-tf-2-1-0 1.2.0 wcwidth 0.2.6 webcolors 1.13 webencodings 0.5.1 websocket-client 1.6.1 Werkzeug 2.2.3 wheel 0.38.4 py37h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main widgetsnbextension 4.0.9 wrapt 1.15.0 xz 5.2.5 h7f8727e_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main y-py 0.6.0 yapf 0.40.1 ypy-websocket 0.8.4 zipp 3.15.0 zlib 1.2.12 h7f8727e_2 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

Do you know what the problem is, please?

SxJyJay commented 1 year ago

Error "numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject" indicates that your numpy version is not compatible with another library, to solve this problem, you can refer to this site. However, since numpy is a foundation library of other libraries like torch, scipy, etc, modifying the numpy version will arouse more version conflicts. Therefore, I suggest you find the library incompatible with the current numpy version, or setup a new environment by referring to my environment details.

Morizb commented 1 year ago

What is your graphics card model and memory? I can only apply two cards, the model is GeForce RTX 2080 Ti, the video memory is 11G, when I set samples_per_gpu=2, workers_per_gpu=2, it will report error when I run the code: cf80622e18c5ee28f841b1dd2e57e70

Do you know how to solve this issue?

SxJyJay commented 1 year ago

We use RTX3090 with 24G memory. You can try some techniques (like fp16, pytorch checkpoint, etc.) for saving the GPU memory.

SxJyJay / MSMDFusion

RuntimeError: SigmoidFocalLoss is not compiled with GPU support #21

packages in environment at /public/home/xzluo/anaconda3/envs/msmd:

Name Version Build Channel