SamsungLabs / fcaf3d

[ECCV2022] FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection
MIT License
231 stars 37 forks source link

How to solve the error reported during the training trial #52

Open nss6planner opened 1 year ago

nss6planner commented 1 year ago

on3d$ bash tools/dist_train.sh configs/fcaf3d/fcaf3d_8x2_sunrgbd-3d-10class.py 2***** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/mmdet/utils/setup_env.py:49: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. f'Setting MKL_NUM_THREADS environment variable for each process ' /home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/mmdet/utils/setup_env.py:49: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. f'Setting MKL_NUM_THREADS environment variable for each process '

filaPro commented 1 year ago

Hi @nss6planner ,

I don't see any errors here, just warnings. Can you provide the full log?

nss6planner commented 1 year ago

Sorry, the complete error report is as follows:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/mmdet/utils/setup_env.py:49: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. f'Setting MKL_NUM_THREADS environment variable for each process ' /home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/mmdet/utils/setup_env.py:49: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. f'Setting MKL_NUM_THREADS environment variable for each process ' Traceback (most recent call last): Traceback (most recent call last): File "tools/train.py", line 263, in File "tools/train.py", line 263, in main() File "tools/train.py", line 171, in main main() File "tools/train.py", line 171, in main init_dist(args.launcher, cfg.dist_params) File "/home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 40, in init_dist init_dist(args.launcher, cfg.dist_params) File "/home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 40, in init_dist _init_dist_pytorch(backend, kwargs) File "/home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 63, in _init_dist_pytorch _init_dist_pytorch(backend, kwargs) File "/home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 63, in _init_dist_pytorch dist.init_process_group(backend=backend, kwargs) File "/home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group dist.init_process_group(backend=backend, kwargs) File "/home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group barrier() File "/home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier barrier() File "/home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8 work = _default_pg.barrier() RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8 Traceback (most recent call last): File "/home/air/anaconda3/envs/mmde/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/air/anaconda3/envs/mmde/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/air/anaconda3/envs/mmde/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/air/anaconda3/envs/mmde/bin/python', '-u', 'tools/train.py', '--local_rank=1', 'configs/fcaf3d/fcaf3d_8x2_sunrgbd-3d-10class.py', '--seed', '0', '--launcher', 'pytorch']' returned non-zero exit status 1.

filaPro commented 1 year ago

Looks like the problem not with fcaf3d, but with your pytorch installation... Can you check that pytorch and for example mmcv works fine on your machine?