cskkxjk / MonoNeRD

(ICCV2023) MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection
MIT License
81 stars 3 forks source link

Segmentation fault 子进程在运行时遇到了段错误 #11

Open saltedfisssh opened 10 months ago

saltedfisssh commented 10 months ago
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/mono3d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/user/anaconda3/envs/mono3d/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/anaconda3/envs/mono3d/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/user/anaconda3/envs/mono3d/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/user/anaconda3/envs/mono3d/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/user/anaconda3/envs/mono3d/bin/python', '-u', 'tools/train.py', '--local_rank=3', '--launcher', 'pytorch', '--fix_random_seed', '--sync_bn', '--save_to_file', '--cfg_file', './configs/stereo/kitti_models/mononerd.3d-and-bev.yaml', '--exp_name', 'exp_name']' died with <Signals.SIGSEGV: 11>.

/home/user/anaconda3/envs/mono3d/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 29 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/user/anaconda3/envs/mono3d/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 29 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

请问这个问题该如何解决?

May I ask how to solve this problem?

或者您可以提供docker吗?

Or can you provide a Docker?

cskkxjk commented 10 months ago

你好 可以提供你的机器配置和gcc版本吗?

saltedfisssh commented 10 months ago

机器配置: 4 × A100-SXM4-40GB, Driver Version: 530.30.02, CUDA Driver Version: 12.1 gcc: 9.5.0 torch: 1.8.1+cu111

网络构建完成, 可以运行到开始训练部分, 然后就程序就退出了:

epochs:   0%|                                                                                                                                                                                            | 0/60 [00:00<?, ?it/s]{'NAME': 'filter_truncated', 'AREA_RATIO_THRESH': None, 'AREA_2D_RATIO_THRESH': None, 'GT_TRUNCATED_THRESH': 0.98}
filter truncated ratio: null 3d boxes [[ 2.99       -3.87       -0.66499996  4.43        1.84        1.75
  -0.2907964 ]] flipped False image idx 890 frame_id 001773 

{'NAME': 'filter_truncated', 'AREA_RATIO_THRESH': None, 'AREA_2D_RATIO_THRESH': None, 'GT_TRUNCATED_THRESH': 0.98}
filter truncated ratio: null 3d boxes [[ 2.93      -4.66      -0.73       4.18       1.86       1.48
  -1.6307963]] flipped False image idx 1040 frame_id 002080 

并且伴随警告:

/home/user/anaconda3/envs/mono3d/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
cskkxjk commented 10 months ago

/home/user/anaconda3/envs/mono3d/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 29 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

这个警告可以监测一下运行时cpu和内存利用率,调小dataloader的num_works。或者尝试用

export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning'

忽略

/home/user/anaconda3/envs/mono3d/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). "

这个警告不影响我们这边的设备正常训练

另外请问你在安装spconv-1.2.1的时候没有遇到问题吗?之前我们也尝试了几次在A100裸机上配环境,一直没有成功;像这样能够正常跑起来的情况倒是第一次见。 你把程序退出时候的详细log发我下,再给我个邮箱,我把我用的dockerfile发你

saltedfisssh commented 10 months ago

已发邮件, 谢谢