facebookarchive / caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.
https://caffe2.ai
Apache License 2.0
8.42k stars 1.94k forks source link

Caffe2 with ibverbs errors #2365

Open sfleisch opened 6 years ago

sfleisch commented 6 years ago

It looks like the same problem as: https://github.com/caffe2/caffe2/issues/616. I'm running in a docker container where I've taken current NVidia GPU Cloud image and rebuilt with Redis and ibverbs suport. This is Caffe2 0.8.1.

I'm using network=host. The ib_write_bw benchmark runs as does the Caffe2 workload when using IPoIB. The script is derived from resnet50_trainer.py, modified for a different model.

AKJBJ:/var/lib/docker/overlay2/l/X7OZ4RLBWAZ2BDHOL5LIN6UIIB:/var/lib/docker/overlay2/l/DESPBGJMW33FNCE2Y3Z634CQLV:/var/lib/docker/overlay2/l/JKVUIFZ6TDPIIJWML5PMKS4RHU:/var/lib/docker/overlay2/l/F4R3RRCEWYKPFEQIHPZCWUTMZY:/var/lib/docker/overlay2/l/MKTJ4' INFO:inception_resnet_v2_trainer:Starting epoch 0/10 mlx5: zeusn041: got completion with error: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 02005104 090007c1 004273d2 terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /opt/caffe2/third_party/gloo/gloo/transport/ibverbs/pair.cc:417] wc->status == IBV_WC_SUCCESS. 4 vs 0. Send for slot 34: local protection error Aborted at 1521688471 (unix time) try "date -d @1521688471" if you are using GNU date PC: @ 0x7f2796b9d428 gsignal SIGABRT (@0x45) received by PID 69 (TID 0x7f24b3b82700) from PID 69; stack trace: @ 0x7f2796f43390 (unknown) @ 0x7f2796b9d428 gsignal @ 0x7f2796b9f02a abort @ 0x7f279165584d __gnu_cxx::__verbose_terminate_handler() @ 0x7f27916536b6 (unknown) @ 0x7f2791653701 std::terminate() @ 0x7f279167ed38 (unknown) @ 0x7f2796f396ba start_thread @ 0x7f2796c6f41d clone @ 0x0 (unknown) ./runiirv2.sh: line 9: 69 Aborted (core dumped) python /lvol/sfleisch/nettest/inception_resnet_v2_trainer.py --train_data /lvol/sfleisch/ilsvrc12_data_300x300/ilsvrc12_train_lmdb/ --test_data /lvol/sfleisch/ilsvrc12_data_300x300/ilsvrc12_val_lmdb/ --batch_size $(($num_gpus$bs)) --run_id 1 --epoch_size $(($num_gpus1000)) --num_epochs 10 --image_size 299 --num_gpus $num_gpus --cudnn_workspace_limit_mb 1024 --num_shards $num_shards --shard_id $shard_id --dtype float16 --float16_compute --enable-tensor-core --redis_host $REDIS_HOST --redis_port $REDIS_PORT --distributed_transport ibverbs --distributed_interfaces mlx5_0

boriskovalev commented 6 years ago

@pietern Can you help?