PointRend training error "ERROR: Unexpected bus error encountered in worker."

I used the following commad to train PointRend by using Coco 2017 dataset, and got errors

python ./projects/PointRend/train_net.py --config-file ./projects/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml --num-gpus 4

the log/error is as follows, Command Line Args: Namespace(config_file='./projects/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=[], resume=False) [02/06 07:12:25 detectron2]: Rank of current process: 0. World size: 4 '''''''''''''''''''' [02/06 07:12:26 detectron2]: Command line arguments: Namespace(config_file='./projects/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=[], resume=False) [02/06 07:12:26 detectron2]: Contents of args.config_file=./projects/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml: BASE: Base-PointRend-RCNN-FPN.yaml MODEL: WEIGHTS: detectron2://ImageNetPretrained/MSRA/R-50.pkl MASK_ON: true RESNETS: DEPTH: 50 ................ [02/06 07:12:46 d2.data.datasets.coco]: Loading datasets/coco/annotations/instances_train2017.json takes 18.61 seconds. [02/06 07:12:47 d2.data.datasets.coco]: Loaded 118287 images in COCO format from datasets/coco/annotations/instances_train2017.json [02/06 07:12:55 d2.data.build]: Removed 1021 images with no usable annotations. 117266 images left. [02/06 07:12:59 d2.data.build]: Distribution of instances among all 80 categories:	category	#instances	category	#instances	category
person	257253	bicycle	7056	car	43533
motorcycle	8654	airplane	5129	bus	6061
train	4570	truck	9970	boat	10576
traffic light	12842	fire hydrant	1865	stop sign	1983
parking meter	1283	bench	9820	bird	10542
cat	4766	dog	5500	horse	6567
sheep	9223	cow	8014	elephant	5484
bear	1294	zebra	5269	giraffe	5128
backpack	8714	umbrella	11265	handbag	12342
tie	6448	suitcase	6112	frisbee	2681
skis	6623	snowboard	2681	sports ball	6299
kite	8802	baseball bat	3273	baseball gl..	3747
skateboard	5536	surfboard	6095	tennis racket	4807
bottle	24070	wine glass	7839	cup	20574
fork	5474	knife	7760	spoon	6159
bowl	14323	banana	9195	apple	5776
sandwich	4356	orange	6302	broccoli	7261
carrot	7758	hot dog	2884	pizza	5807
donut	7005	cake	6296	chair	38073
couch	5779	potted plant	8631	bed	4192
dining table	15695	toilet	4149	tv	5803
laptop	4960	mouse	2261	remote	5700
keyboard	2854	cell phone	6422	microwave	1672
oven	3334	toaster	225	sink	5609
refrigerator	2634	book	24077	clock	6320
vase	6577	scissors	1464	teddy bear	4729
hair drier	198	toothbrush	1945
total	849949

[02/06 07:12:59 d2.data.detection_utils]: TransformGens used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()] [02/06 07:12:59 d2.data.build]: Using training sampler TrainingSampler [02/06 07:13:01 fvcore.common.checkpoint]: Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl [02/06 07:13:01 fvcore.common.file_io]: URL https://dl.fbaipublicfiles.com/detectron2/ImageNetPretrained/MSRA/R-50.pkl cached in /root/.torch/fvcore_cache/detectron2/ImageNetPretrained/MSRA/R-50.pkl '''''''''''''''' [02/06 07:13:01 d2.checkpoint.c2_model_loading]: Some model parameters are not in the checkpoint: backbone.fpn_lateral2.{bias, weight} backbone.fpn_lateral3.{bias, weight} backbone.fpn_lateral4.{bias, weight} backbone.fpn_lateral5.{bias, weight} backbone.fpn_output2.{bias, weight} backbone.fpn_output3.{bias, weight} backbone.fpn_output4.{bias, weight} backbone.fpn_output5.{bias, weight} proposal_generator.anchor_generator.cell_anchors.{0, 1, 2, 3, 4} proposal_generator.rpn_head.anchor_deltas.{bias, weight} proposal_generator.rpn_head.conv.{bias, weight} proposal_generator.rpn_head.objectness_logits.{bias, weight} roi_heads.box_head.fc1.{bias, weight} roi_heads.box_head.fc2.{bias, weight} roi_heads.box_predictor.bbox_pred.{bias, weight} roi_heads.box_predictor.cls_score.{bias, weight} roi_heads.mask_coarse_head.coarse_mask_fc1.{bias, weight} roi_heads.mask_coarse_head.coarse_mask_fc2.{bias, weight} roi_heads.mask_coarse_head.prediction.{bias, weight} roi_heads.mask_coarse_head.reduce_spatial_dim_conv.{bias, weight} roi_heads.mask_point_head.fc1.{bias, weight} roi_heads.mask_point_head.fc2.{bias, weight} roi_heads.mask_point_head.fc3.{bias, weight} roi_heads.mask_point_head.predictor.{bias, weight} [02/06 07:13:01 d2.checkpoint.c2_model_loading]: The checkpoint contains parameters not used by the model: fc1000_b fc1000_w conv1_b [02/06 07:13:02 d2.engine.train_loop]: Starting training from iteration 0 ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage fd, size = storage._sharefd() RuntimeError: unable to write to file Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage fd, size = storage._sharefd() RuntimeError: unable to write to file Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage fd, size = storage._sharefd() RuntimeError: unable to write to file Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed

the GPU memory seems to be enough. Thu Feb 6 07:41:13 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:5A:00.0 Off | 0 | | N/A 62C P0 101W / 250W | 10823MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... On | 00000000:5E:00.0 Off | 0 | | N/A 55C P0 96W / 250W | 10298MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-PCIE... On | 00000000:62:00.0 Off | 0 | | N/A 58C P0 103W / 250W | 10298MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-PCIE... On | 00000000:66:00.0 Off | 0 | | N/A 59C P0 103W / 250W | 10298MiB / 32510MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-PCIE... On | 00000000:B5:00.0 Off | 0 | | N/A 56C P0 102W / 250W | 10290MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-PCIE... On | 00000000:B9:00.0 Off | 0 | | N/A 62C P0 110W / 250W | 10296MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-PCIE... On | 00000000:BD:00.0 Off | 0 | | N/A 58C P0 62W / 250W | 10296MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-PCIE... On | 00000000:C1:00.0 Off | 0 | | N/A 57C P0 57W / 250W | 10296MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================|

Environment:

Run python -m detectron2.utils.collect_env in the environment where you observerd the issue, and paste the output. [02/06 07:12:26 detectron2]: Environment info:

sys.platform linux Python 3.6.9	Anaconda, Inc.	(default, Jul 30 2019, 19:07:31) [GCC 7.3.0] numpy 1.17.2 detectron2 0.1 @/data/xxx/InstanceSegmentation/detectron2/detectron2-master/detectron2 detectron2 compiler GCC 5.4 detectron2 CUDA compiler 10.1 detectron2 arch flags sm_70 DETECTRON2_ENV_MODULE PyTorch 1.3.0 @/opt/conda/lib/python3.6/site-packages/torch PyTorch debug build False CUDA available True GPU 0,1,2,3,4,5,6,7 Tesla V100-PCIE-32GB CUDA_HOME /usr/local/cuda NVCC Cuda compilation tools, release 10.1, V10.1.243 Pillow 6.2.2 torchvision 0.4.1a0+d94043a @/opt/conda/lib/python3.6/site-packages/torchvision torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75 cv2 4.2.0

PyTorch built with:

GCC 7.3
Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_50,code=compute_50
CuDNN 7.6.3
Magma 2.5.1
Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

How can I do? Thanks

facebookresearch / detectron2

PointRend training error "ERROR: Unexpected bus error encountered in worker." #812

Environment: