astra-vision / MonoScene

[CVPR 2022] "MonoScene: Monocular 3D Semantic Scene Completion": 3D Semantic Occupancy Prediction from a single image
https://astra-vision.github.io/MonoScene/
Apache License 2.0
708 stars 69 forks source link

Unexpected segmentation fault encountered in worker. #67

Closed Harsh188 closed 1 year ago

Harsh188 commented 1 year ago

Description

Unable to run train_monoscene.py and eval_monoscene.py. I can't seem to debug the issue as I'm unable to see any logging/print statement output on my console.

Terminal Log:

The following log is for train_monoscene.py:

python3 monoscene/scripts/train_monoscene.py     dataset=kitti     enable_log=true     kitti_root=$KITTI_ROOT     kitti_preprocess_root=$KITTI_PREPROCESS    kitti_logdir=$KITTI_LOG     n_gpus=1 batch_size=1    
INFO:root:Calling main
DEBUG:hydra.core.utils:Setting JobRuntime:name=UNKNOWN_NAME
DEBUG:hydra.core.utils:Setting JobRuntime:name=train_monoscene
[2023-06-23 14:07:31,189][root][INFO] - Dataset Selected: KITTI
exp_kitti_1_FrusSize_8_nRelations4_WD0.0001_lr0.0001_CEssc_geoScalLoss_semScalLoss_fpLoss_CERel_3DCRP_Proj_2_4_8
n_relations 4
Using cache found in /root/.cache/torch/hub/rwightman_gen-efficientnet-pytorch_master
Loading base model ()...Done.
Removing last two layers (global_pool & classifier).
Building Encoder-Decoder model..Done.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
[2023-06-23 14:07:32,727][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2023-06-23 14:07:32,727][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name           | Type       | Params
----------------------------------------------
0 | projects       | ModuleDict | 0     
1 | net_3d_decoder | UNet3D     | 16.9 M
2 | net_rgb        | UNet2D     | 132 M 
----------------------------------------------
149 M     Trainable params
0         Non-trainable params
149 M     Total params
598.222   Total estimated model params size (MB)
Validation sanity check:   0%|                                                                            | 0/2 [00:00<?, ?it/s]ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/usr/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1677) is killed by signal: Segmentation fault. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "monoscene/scripts/train_monoscene.py", line 178, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/hydra/main.py", line 32, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 346, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 201, in run_and_report
    raise ex
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 347, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 107, in run
    return run_job(
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 128, in run_job
    ret.return_value = task_function(task_cfg)
  File "monoscene/scripts/train_monoscene.py", line 173, in main
    trainer.fit(model, data_module)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 922, in _run
    self._dispatch()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 990, in _dispatch
    self.accelerator.start_training(self)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1000, in run_stage
    return self._run_train()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1122, in _run_sanity_check
    self._evaluation_loop.run()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 94, in advance
    batch_idx, batch = next(dataloader_iter)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1272, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1677) exited unexpectedly

The following log is for eval_monoscene.py:


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
n_relations 4
Using cache found in /root/.cache/torch/hub/rwightman_gen-efficientnet-pytorch_master
Loading base model ()...Done.
Removing last two layers (global_pool & classifier).
Building Encoder-Decoder model..Done.
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py:678: LightningDeprecationWarning: `trainer.test(test_dataloaders)` is deprecated in v1.4 and will be removed in v1.6. Use `trainer.test(dataloaders)` instead.
  rank_zero_deprecation(
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
[2023-06-23 13:56:12,695][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2023-06-23 13:56:12,695][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing: 0it [00:00, ?it/s]ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/usr/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1467) is killed by signal: Segmentation fault. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "monoscene/scripts/eval_monoscene.py", line 71, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/hydra/main.py", line 32, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 346, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 201, in run_and_report
    raise ex
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 347, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 107, in run
    return run_job(
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 128, in run_job
    ret.return_value = task_function(task_cfg)
  File "monoscene/scripts/eval_monoscene.py", line 67, in main
    trainer.test(model, test_dataloaders=val_dataloader)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 705, in test
    results = self._run(model)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 922, in _run
    self._dispatch()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
    self.accelerator.start_evaluating(self)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 95, in start_evaluating
    self.training_type_plugin.start_evaluating(trainer)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 165, in start_evaluating
    self._results = trainer.run_stage()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 997, in run_stage
    return self._run_evaluate()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1083, in _run_evaluate
    eval_loop_results = self._evaluation_loop.run()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 94, in advance
    batch_idx, batch = next(dataloader_iter)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1272, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1467) exited unexpectedly
Exception ignored in: <function tqdm.__del__ at 0x7f4172202af0>```
anhquancao commented 1 year ago

Hi @Harsh188 , The problem comes from the overflow of memory in the workers of the dataloader. I think you should reduce the number of workers.

Harsh188 commented 1 year ago

Hi @anhquancao, thanks for the quick response.

I tried setting num_workers_per_gpu to 1 and I'm still facing the issue. I'm rocking a RTX 3080 Founders edition (10GB VRAM)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:2B:00.0  On |                  N/A |
| 30%   47C    P5    77W / 320W |    641MiB / 10240MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Here's the output for python -c 'import torch;print(torch.__version__);print(torch.version.cuda)'

1.7.1
10.2

How much memory is required to execute the pretrained model?

anhquancao commented 1 year ago

Hi, I mean the RAM not the GPU memory. Did you set batch_size to 1 also? The model need 32Gb GPU memory to train with batch_size of 1. For inference, it is probably much cheaper but I don't have the number.

Steven0928 commented 1 year ago

Hi,what if I only have a single 3090 how can I run training sequnce

anhquancao commented 1 year ago

I think you can try the following:

For the RAM problem, you might need to optimize the data type of the variable.

Harsh188 commented 1 year ago

Thanks @anhquancao. Closing this issue as it's mostly a hardware limitation on my end.