PRBonn / lidar-bonnetal

Semantic and Instance Segmentation of LiDAR point clouds for autonomous driving
http://semantic-kitti.org
MIT License
961 stars 206 forks source link

Why train in docker, show "Training in device: cpu" and "Bus error". #30

Closed Xl-wj closed 4 years ago

Xl-wj commented 4 years ago

Hi, thank you for the work of open source。 When I run in docker, encountered two problems, train output are as follows.


INTERFACE: dataset /bonnet/KITTI/ arch_cfg config/arch/squeezeseg.yaml data_cfg config/labels/semantic-kitti.yaml log /bonnet/lidar-bonnetal/logs/ pretrained None

Commit hash (training version): b'4233111'

Opening arch config file config/arch/squeezeseg.yaml Opening data config file config/labels/semantic-kitti.yaml No pretrained directory found. Copying files to /bonnet/lidar-bonnetal/logs/ for further reference. Sequences folder exists! Using sequences from /bonnet/KITTI/sequences parsing seq 00 parsing seq 01 parsing seq 02 parsing seq 03 parsing seq 04 parsing seq 05 parsing seq 06 parsing seq 07 parsing seq 09 parsing seq 10 Using 2761 scans from sequences [0, 1, 2, 3, 4, 5, 6, 7, 9, 10] Sequences folder exists! Using sequences from /bonnet/KITTI/sequences parsing seq 05 Using 2761 scans from sequences [5] Loss weights from content: tensor([ 0.0000, 22.9317, 857.5627, 715.1100, 315.9618, 356.2452, 747.6170, 887.2239, 963.8915, 5.0051, 63.6247, 6.9002, 203.8796, 7.4802, 13.6315, 3.7339, 142.1462, 12.6355, 259.3699, 618.9667]) Using SqueezeNet Backbone Depth of backbone input = 5 Original OS: 16 New OS: 16 Strides: [2, 2, 2, 2] Decoder original OS: 16 Decoder new OS: 16 Decoder strides: [2, 2, 2, 2] Total number of parameters: 915540 Total number of parameters requires_grad: 915540 Param encoder 724032 Param decoder 179968 Param head 11540 No path to pretrained, using random init. Training in device: cpu Ignoring class 0 in IoU evaluation [IOU EVAL] IGNORE: tensor([0]) [IOU EVAL] INCLUDE: tensor([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]) ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Traceback (most recent call last): File "./train.py", line 115, in trainer.train() File "../../tasks/semantic/modules/trainer.py", line 236, in train ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). show_scans=self.ARCH["train"]["show_scans"]) File "../../tasks/semantic/modules/trainer.py", line 307, in train_epoch for i, (in_vol, proj_mask, projlabels, , path_seq, pathname, , , , , , , , , ) in enumerate(train_loader): File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 576, in next idx, batch = self._get_batch() File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 553, in _get_batch success, data = self._try_get_batch() File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch data = self.data_queue.get(timeout=timeout) File "/usr/lib/python3.5/multiprocessing/queues.py", line 104, in get if timeout < 0 or not self._poll(timeout): File "/usr/lib/python3.5/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/usr/lib/python3.5/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/usr/lib/python3.5/multiprocessing/connection.py", line 911, in wait ready = selector.select(timeout) File "/usr/lib/python3.5/selectors.py", line 376, in select fd_event_list = self._poll.poll(timeout) File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 212) is killed by signal: Bus error.

Looking forward to your reply.

tano297 commented 4 years ago

at first sightn it looks like you are running out of memory, can you try with workers=0?

Xl-wj commented 4 years ago

Thanks, it works! @tano297

at first sightn it looks like you are running out of memory, can you try with workers=0?