the log/error is as follows,
Command Line Args: Namespace(config_file='./projects/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=[], resume=False)
[02/06 07:12:25 detectron2]: Rank of current process: 0. World size: 4
''''''''''''''''''''
[02/06 07:12:26 detectron2]: Command line arguments: Namespace(config_file='./projects/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=[], resume=False)
[02/06 07:12:26 detectron2]: Contents of args.config_file=./projects/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml:
BASE: Base-PointRend-RCNN-FPN.yaml
MODEL:
WEIGHTS: detectron2://ImageNetPretrained/MSRA/R-50.pkl
MASK_ON: true
RESNETS:
DEPTH: 50
................
[02/06 07:12:46 d2.data.datasets.coco]: Loading datasets/coco/annotations/instances_train2017.json takes 18.61 seconds.
[02/06 07:12:47 d2.data.datasets.coco]: Loaded 118287 images in COCO format from datasets/coco/annotations/instances_train2017.json
[02/06 07:12:55 d2.data.build]: Removed 1021 images with no usable annotations. 117266 images left.
[02/06 07:12:59 d2.data.build]: Distribution of instances among all 80 categories:
category
#instances
category
#instances
category
#instances
person
257253
bicycle
7056
car
43533
motorcycle
8654
airplane
5129
bus
6061
train
4570
truck
9970
boat
10576
traffic light
12842
fire hydrant
1865
stop sign
1983
parking meter
1283
bench
9820
bird
10542
cat
4766
dog
5500
horse
6567
sheep
9223
cow
8014
elephant
5484
bear
1294
zebra
5269
giraffe
5128
backpack
8714
umbrella
11265
handbag
12342
tie
6448
suitcase
6112
frisbee
2681
skis
6623
snowboard
2681
sports ball
6299
kite
8802
baseball bat
3273
baseball gl..
3747
skateboard
5536
surfboard
6095
tennis racket
4807
bottle
24070
wine glass
7839
cup
20574
fork
5474
knife
7760
spoon
6159
bowl
14323
banana
9195
apple
5776
sandwich
4356
orange
6302
broccoli
7261
carrot
7758
hot dog
2884
pizza
5807
donut
7005
cake
6296
chair
38073
couch
5779
potted plant
8631
bed
4192
dining table
15695
toilet
4149
tv
5803
laptop
4960
mouse
2261
remote
5700
keyboard
2854
cell phone
6422
microwave
1672
oven
3334
toaster
225
sink
5609
refrigerator
2634
book
24077
clock
6320
vase
6577
scissors
1464
teddy bear
4729
hair drier
198
toothbrush
1945
total
849949
[02/06 07:12:59 d2.data.detection_utils]: TransformGens used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()]
[02/06 07:12:59 d2.data.build]: Using training sampler TrainingSampler
[02/06 07:13:01 fvcore.common.checkpoint]: Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl
[02/06 07:13:01 fvcore.common.file_io]: URL https://dl.fbaipublicfiles.com/detectron2/ImageNetPretrained/MSRA/R-50.pkl cached in /root/.torch/fvcore_cache/detectron2/ImageNetPretrained/MSRA/R-50.pkl
''''''''''''''''
[02/06 07:13:01 d2.checkpoint.c2_model_loading]: Some model parameters are not in the checkpoint:
backbone.fpn_lateral2.{bias, weight}
backbone.fpn_lateral3.{bias, weight}
backbone.fpn_lateral4.{bias, weight}
backbone.fpn_lateral5.{bias, weight}
backbone.fpn_output2.{bias, weight}
backbone.fpn_output3.{bias, weight}
backbone.fpn_output4.{bias, weight}
backbone.fpn_output5.{bias, weight}
proposal_generator.anchor_generator.cell_anchors.{0, 1, 2, 3, 4}
proposal_generator.rpn_head.anchor_deltas.{bias, weight}
proposal_generator.rpn_head.conv.{bias, weight}
proposal_generator.rpn_head.objectness_logits.{bias, weight}
roi_heads.box_head.fc1.{bias, weight}
roi_heads.box_head.fc2.{bias, weight}
roi_heads.box_predictor.bbox_pred.{bias, weight}
roi_heads.box_predictor.cls_score.{bias, weight}
roi_heads.mask_coarse_head.coarse_mask_fc1.{bias, weight}
roi_heads.mask_coarse_head.coarse_mask_fc2.{bias, weight}
roi_heads.mask_coarse_head.prediction.{bias, weight}
roi_heads.mask_coarse_head.reduce_spatial_dim_conv.{bias, weight}
roi_heads.mask_point_head.fc1.{bias, weight}
roi_heads.mask_point_head.fc2.{bias, weight}
roi_heads.mask_point_head.fc3.{bias, weight}
roi_heads.mask_point_head.predictor.{bias, weight}
[02/06 07:13:01 d2.checkpoint.c2_model_loading]: The checkpoint contains parameters not used by the model:
fc1000_b
fc1000_w
conv1_b
[02/06 07:13:02 d2.engine.train_loop]: Starting training from iteration 0
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
obj = _ForkingPickler.dumps(obj)
File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage
fd, size = storage._sharefd()
RuntimeError: unable to write to file
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
obj = _ForkingPickler.dumps(obj)
File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage
fd, size = storage._sharefd()
RuntimeError: unable to write to file
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
obj = _ForkingPickler.dumps(obj)
File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage
fd, size = storage._sharefd()
RuntimeError: unable to write to file
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
the GPU memory seems to be enough.
Thu Feb 6 07:41:13 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:5A:00.0 Off | 0 |
| N/A 62C P0 101W / 250W | 10823MiB / 32510MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000000:5E:00.0 Off | 0 |
| N/A 55C P0 96W / 250W | 10298MiB / 32510MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... On | 00000000:62:00.0 Off | 0 |
| N/A 58C P0 103W / 250W | 10298MiB / 32510MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... On | 00000000:66:00.0 Off | 0 |
| N/A 59C P0 103W / 250W | 10298MiB / 32510MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-PCIE... On | 00000000:B5:00.0 Off | 0 |
| N/A 56C P0 102W / 250W | 10290MiB / 32510MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-PCIE... On | 00000000:B9:00.0 Off | 0 |
| N/A 62C P0 110W / 250W | 10296MiB / 32510MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-PCIE... On | 00000000:BD:00.0 Off | 0 |
| N/A 58C P0 62W / 250W | 10296MiB / 32510MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-PCIE... On | 00000000:C1:00.0 Off | 0 |
| N/A 57C P0 57W / 250W | 10296MiB / 32510MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
Environment:
Run python -m detectron2.utils.collect_env in the environment where you observerd the issue, and paste the output.
[02/06 07:12:26 detectron2]: Environment info:
I used the following commad to train PointRend by using Coco 2017 dataset, and got errors
[02/06 07:12:59 d2.data.detection_utils]: TransformGens used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()] [02/06 07:12:59 d2.data.build]: Using training sampler TrainingSampler [02/06 07:13:01 fvcore.common.checkpoint]: Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl [02/06 07:13:01 fvcore.common.file_io]: URL https://dl.fbaipublicfiles.com/detectron2/ImageNetPretrained/MSRA/R-50.pkl cached in /root/.torch/fvcore_cache/detectron2/ImageNetPretrained/MSRA/R-50.pkl '''''''''''''''' [02/06 07:13:01 d2.checkpoint.c2_model_loading]: Some model parameters are not in the checkpoint: backbone.fpn_lateral2.{bias, weight} backbone.fpn_lateral3.{bias, weight} backbone.fpn_lateral4.{bias, weight} backbone.fpn_lateral5.{bias, weight} backbone.fpn_output2.{bias, weight} backbone.fpn_output3.{bias, weight} backbone.fpn_output4.{bias, weight} backbone.fpn_output5.{bias, weight} proposal_generator.anchor_generator.cell_anchors.{0, 1, 2, 3, 4} proposal_generator.rpn_head.anchor_deltas.{bias, weight} proposal_generator.rpn_head.conv.{bias, weight} proposal_generator.rpn_head.objectness_logits.{bias, weight} roi_heads.box_head.fc1.{bias, weight} roi_heads.box_head.fc2.{bias, weight} roi_heads.box_predictor.bbox_pred.{bias, weight} roi_heads.box_predictor.cls_score.{bias, weight} roi_heads.mask_coarse_head.coarse_mask_fc1.{bias, weight} roi_heads.mask_coarse_head.coarse_mask_fc2.{bias, weight} roi_heads.mask_coarse_head.prediction.{bias, weight} roi_heads.mask_coarse_head.reduce_spatial_dim_conv.{bias, weight} roi_heads.mask_point_head.fc1.{bias, weight} roi_heads.mask_point_head.fc2.{bias, weight} roi_heads.mask_point_head.fc3.{bias, weight} roi_heads.mask_point_head.predictor.{bias, weight} [02/06 07:13:01 d2.checkpoint.c2_model_loading]: The checkpoint contains parameters not used by the model: fc1000_b fc1000_w conv1_b [02/06 07:13:02 d2.engine.train_loop]: Starting training from iteration 0 ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage fd, size = storage._sharefd() RuntimeError: unable to write to file Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage fd, size = storage._sharefd() RuntimeError: unable to write to file Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage fd, size = storage._sharefd() RuntimeError: unable to write to file Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
the GPU memory seems to be enough. Thu Feb 6 07:41:13 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:5A:00.0 Off | 0 | | N/A 62C P0 101W / 250W | 10823MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... On | 00000000:5E:00.0 Off | 0 | | N/A 55C P0 96W / 250W | 10298MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-PCIE... On | 00000000:62:00.0 Off | 0 | | N/A 58C P0 103W / 250W | 10298MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-PCIE... On | 00000000:66:00.0 Off | 0 | | N/A 59C P0 103W / 250W | 10298MiB / 32510MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-PCIE... On | 00000000:B5:00.0 Off | 0 | | N/A 56C P0 102W / 250W | 10290MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-PCIE... On | 00000000:B9:00.0 Off | 0 | | N/A 62C P0 110W / 250W | 10296MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-PCIE... On | 00000000:BD:00.0 Off | 0 | | N/A 58C P0 62W / 250W | 10296MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-PCIE... On | 00000000:C1:00.0 Off | 0 | | N/A 57C P0 57W / 250W | 10296MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================|
Environment:
Run
python -m detectron2.utils.collect_env
in the environment where you observerd the issue, and paste the output. [02/06 07:12:26 detectron2]: Environment info:PyTorch built with:
How can I do? Thanks