It looks like the same problem as: https://github.com/caffe2/caffe2/issues/616.
I'm running in a docker container where I've taken current NVidia GPU Cloud image and rebuilt with Redis and ibverbs suport. This is Caffe2 0.8.1.
I'm using network=host. The ib_write_bw benchmark runs as does the Caffe2 workload when using IPoIB.
The script is derived from resnet50_trainer.py, modified for a different model.
AKJBJ:/var/lib/docker/overlay2/l/X7OZ4RLBWAZ2BDHOL5LIN6UIIB:/var/lib/docker/overlay2/l/DESPBGJMW33FNCE2Y3Z634CQLV:/var/lib/docker/overlay2/l/JKVUIFZ6TDPIIJWML5PMKS4RHU:/var/lib/docker/overlay2/l/F4R3RRCEWYKPFEQIHPZCWUTMZY:/var/lib/docker/overlay2/l/MKTJ4'
INFO:inception_resnet_v2_trainer:Starting epoch 0/10
mlx5: zeusn041: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 02005104 090007c1 004273d2
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /opt/caffe2/third_party/gloo/gloo/transport/ibverbs/pair.cc:417] wc->status == IBV_WC_SUCCESS. 4 vs 0. Send for slot 34: local protection error
Aborted at 1521688471 (unix time) try "date -d @1521688471" if you are using GNU date
PC: @ 0x7f2796b9d428 gsignal
SIGABRT (@0x45) received by PID 69 (TID 0x7f24b3b82700) from PID 69; stack trace:
@ 0x7f2796f43390 (unknown)
@ 0x7f2796b9d428 gsignal
@ 0x7f2796b9f02a abort
@ 0x7f279165584d __gnu_cxx::__verbose_terminate_handler()
@ 0x7f27916536b6 (unknown)
@ 0x7f2791653701 std::terminate()
@ 0x7f279167ed38 (unknown)
@ 0x7f2796f396ba start_thread
@ 0x7f2796c6f41d clone
@ 0x0 (unknown)
./runiirv2.sh: line 9: 69 Aborted (core dumped) python /lvol/sfleisch/nettest/inception_resnet_v2_trainer.py --train_data /lvol/sfleisch/ilsvrc12_data_300x300/ilsvrc12_train_lmdb/ --test_data /lvol/sfleisch/ilsvrc12_data_300x300/ilsvrc12_val_lmdb/ --batch_size $(($num_gpus$bs)) --run_id 1 --epoch_size $(($num_gpus1000)) --num_epochs 10 --image_size 299 --num_gpus $num_gpus --cudnn_workspace_limit_mb 1024 --num_shards $num_shards --shard_id $shard_id --dtype float16 --float16_compute --enable-tensor-core --redis_host $REDIS_HOST --redis_port $REDIS_PORT --distributed_transport ibverbs --distributed_interfaces mlx5_0
It looks like the same problem as: https://github.com/caffe2/caffe2/issues/616. I'm running in a docker container where I've taken current NVidia GPU Cloud image and rebuilt with Redis and ibverbs suport. This is Caffe2 0.8.1.
I'm using network=host. The ib_write_bw benchmark runs as does the Caffe2 workload when using IPoIB. The script is derived from resnet50_trainer.py, modified for a different model.
AKJBJ:/var/lib/docker/overlay2/l/X7OZ4RLBWAZ2BDHOL5LIN6UIIB:/var/lib/docker/overlay2/l/DESPBGJMW33FNCE2Y3Z634CQLV:/var/lib/docker/overlay2/l/JKVUIFZ6TDPIIJWML5PMKS4RHU:/var/lib/docker/overlay2/l/F4R3RRCEWYKPFEQIHPZCWUTMZY:/var/lib/docker/overlay2/l/MKTJ4' INFO:inception_resnet_v2_trainer:Starting epoch 0/10 mlx5: zeusn041: got completion with error: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 02005104 090007c1 004273d2 terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /opt/caffe2/third_party/gloo/gloo/transport/ibverbs/pair.cc:417] wc->status == IBV_WC_SUCCESS. 4 vs 0. Send for slot 34: local protection error Aborted at 1521688471 (unix time) try "date -d @1521688471" if you are using GNU date PC: @ 0x7f2796b9d428 gsignal SIGABRT (@0x45) received by PID 69 (TID 0x7f24b3b82700) from PID 69; stack trace: @ 0x7f2796f43390 (unknown) @ 0x7f2796b9d428 gsignal @ 0x7f2796b9f02a abort @ 0x7f279165584d __gnu_cxx::__verbose_terminate_handler() @ 0x7f27916536b6 (unknown) @ 0x7f2791653701 std::terminate() @ 0x7f279167ed38 (unknown) @ 0x7f2796f396ba start_thread @ 0x7f2796c6f41d clone @ 0x0 (unknown) ./runiirv2.sh: line 9: 69 Aborted (core dumped) python /lvol/sfleisch/nettest/inception_resnet_v2_trainer.py --train_data /lvol/sfleisch/ilsvrc12_data_300x300/ilsvrc12_train_lmdb/ --test_data /lvol/sfleisch/ilsvrc12_data_300x300/ilsvrc12_val_lmdb/ --batch_size $(($num_gpus$bs)) --run_id 1 --epoch_size $(($num_gpus1000)) --num_epochs 10 --image_size 299 --num_gpus $num_gpus --cudnn_workspace_limit_mb 1024 --num_shards $num_shards --shard_id $shard_id --dtype float16 --float16_compute --enable-tensor-core --redis_host $REDIS_HOST --redis_port $REDIS_PORT --distributed_transport ibverbs --distributed_interfaces mlx5_0