facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
Apache License 2.0
26.28k stars 5.46k forks source link

multi-GPU training throw an illegal memory access #32

Closed zdwong closed 6 years ago

zdwong commented 6 years ago

When I use one GPU to train, there is no problem. But when I use two or four GPUs, the problem come out. The log output:

terminate called after throwing an instance of 'caffe2::EnforceNotMet' what(): [enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator: input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 } Aborted at 1516866180 (unix time) try "date -d @1516866180" if you are using GNU date terminate called recursively terminate called recursively terminate called recursively PC: @ 0x7ff67559f428 gsignal terminate called recursively terminate called recursively E0125 07:43:00.745853 55683 pybind_state.h:422] Exception encountered running PythonOp function: RuntimeError: [enforce fail at context_gpu.h:307] error == cudaSuccess. 77 vs 0. Error at: /mnt/hzhida/project/caffe2/caffe2/core/context_gpu.h:307: an illegal memory access was encountered

At: /mnt/hzhida/facebook/detectron/lib/ops/generate_proposals.py(101): forward SIGABRT (@0x3e80000d84f) received by PID 55375 (TID 0x7ff453fff700) from PID 55375; stack trace: terminate called recursively @ 0x7ff675945390 (unknown) @ 0x7ff67559f428 gsignal @ 0x7ff6755a102a abort @ 0x7ff66f37e84d __gnu_cxx::__verbose_terminate_handler() @ 0x7ff66f37c6b6 (unknown) @ 0x7ff66f37c701 std::terminate() @ 0x7ff66f3a7d38 (unknown) @ 0x7ff67593b6ba start_thread @ 0x7ff67567141d clone @ 0x0 (unknown) Aborted (core dumped)

yousongzhu commented 6 years ago

I got the same error. The difference is when i use one GPU or two GPUs , there is no problem. But using 4 GPUs to train Mask RCNN (mask_rcnn_R-101-FPN) or RetinaNet (retinanet_R-101-FPN), the same problem occurs.

lwher commented 6 years ago

I have the same problem when I train the tutorial_Res50 network with two or more GPUs.

jwnsu commented 6 years ago

Encountered same issue when specifying GPU ids (i.e. different from lowest ids, e.g. '1,3,5,7' for 4 GPUs). If lowest GPU ids are specified, training goes on fine.

rbgirshick commented 6 years ago

@jwnsu: we're working on a fix so that when CUDA_VISIBLE_DEVICES does not use the lowest ids training still works. Thanks for reporting and diagnosing.

rbgirshick commented 6 years ago

Hi @jwnsu, @coolbrain, @tshizys, @lwher: we are unable to reproduce this issue on our side.

Can you each provide some more information that might reveal a common pattern?

In particular:

Here's what we see when training, for example, with GPU ids 1,3,5,7:

CUDA_VISIBLE_DEVICES=1,3,5,7 python2 tools/train_net.py --cfg configs/12_2017_baselines/e2e_faster_rcnn_R-50-FPN_1x.yaml OUTPUT_DIR /tmp/dbg-cvd-train TRAIN.DATASETS "('coco_2014_minival',)" NUM_GPUS 4

Every 0.1s: nvidia-smi                                                                                                                                                                                                                                                                                                                             Fri Jan 26 09:09:26 2018

Fri Jan 26 09:09:26 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40           On   | 0000:07:00.0     Off |                  Off |
|  0%   42C    P8    17W / 250W |      0MiB / 12209MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40           On   | 0000:08:00.0     Off |                  Off |
|  0%   51C    P0   144W / 250W |   7214MiB / 12209MiB |     46%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M40           On   | 0000:09:00.0     Off |                  Off |
|  0%   38C    P8    19W / 250W |      0MiB / 12209MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M40           On   | 0000:0A:00.0     Off |                  Off |
|  0%   52C    P0   220W / 250W |   7502MiB / 12209MiB |     38%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla M40           On   | 0000:0B:00.0     Off |                  Off |
|  0%   40C    P8    17W / 250W |      0MiB / 12209MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla M40           On   | 0000:0C:00.0     Off |                  Off |
|  0%   60C    P0    85W / 250W |   7081MiB / 12209MiB |     48%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla M40           On   | 0000:0D:00.0     Off |                  Off |
|  0%   40C    P8    20W / 250W |      0MiB / 12209MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla M40           On   | 0000:0E:00.0     Off |                  Off |
|  0%   56C    P0    81W / 250W |   7494MiB / 12209MiB |     40%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    1   2871837    C   ..............gcc-5-glibc-2.23/bin/python2.7  7210MiB |
|    3   2871837    C   ..............gcc-5-glibc-2.23/bin/python2.7  7498MiB |
|    5   2871837    C   ..............gcc-5-glibc-2.23/bin/python2.7  7077MiB |
|    7   2871837    C   ..............gcc-5-glibc-2.23/bin/python2.7  7490MiB |
+-----------------------------------------------------------------------------+
zdwong commented 6 years ago

Operating system: Ubuntu 16.04 Compiler version: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 CUDA version: 8.0 cuDNN version: v5.1 NVIDIA driver version: 384.111

nvidia-smi: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.111 Driver Version: 384.111 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+| | 0 Tesla M60 Off | 00001543:00:00.0 Off | Off | | N/A 42C P0 41W / 150W | 0MiB / 8123MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla M60 Off | 00003134:00:00.0 Off | Off | | N/A 42C P0 39W / 150W | 0MiB / 8123MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla M60 Off | 00004975:00:00.0 Off | Off | | N/A 38C P0 41W / 150W | 0MiB / 8123MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla M60 Off | 0000F3E6:00:00.0 Off | Off | | N/A 38C P0 40W / 150W | 0MiB / 8123MiB | 0% Default | +-------------------------------+----------------------+----------------------+

yousongzhu commented 6 years ago

Operating system: CentOS Linux release 7.1.1503 Compiler version: gcc version 4.8.2 CUDA version: CUDA 8.0 cuDNN version: cuDNN 6.0.21 NVIDIA driver version: 375.26 GPU models: 4x GeForce GTX TITAN X (12G)

nvidia-smi: image

When using 4 GPUs (0,1,2,3) to train Mask RCNN (e2e_mask_rcnn_R-101-FPN) , RetinaNet (retinanet_R-101-FPN) or Faster RCNN (e2e_faster_rcnn_R-50-FPN), the error “context_gpu.h:307: an illegal memory access was encountered” or “context_gpu.h:170. Encountered CUDA error: an illegal memory access was encountered Error from operator: input: "gpu_0/retnet_cls_pred_fpn3_b_grad" input: "gpu_2/retnet_cls_pred_fpn3_b_grad" output: "gpu_0/retnet_cls_pred_fpn3_b_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 } ” occurs.

But using one GPU or two GPUS (0,1 or 2,3), it can be trained normally. Thanks.

rbgirshick commented 6 years ago

@jwnsu: looking at your error more closely ("invalid device ordinal"), it looks like you're trying to train with a config set up for 8 GPUs but restricting the process to have only access to 4 (via CUDA_VISIBLE_DEVICES). The "invalid device ordinal" error is because it's trying to create ops on devices that the process does not have access to.

rbgirshick commented 6 years ago

@coolbrain, @tshizys: thanks for the details. What happens if you use two GPUs using ids {0,2}, {0,3}, {1,2}, or {1,3}?

jwnsu commented 6 years ago

@rbgirshick you are right, picked wrong config file (with 8 GPUs setting) to try yesterday. Just tried again with the right config file (4 GPUs, error from gpu ids "1,2,4,5", "0,1,2,3" works fine), the error is now similar to what others are seeing:

I0127 09:06:48.220716 10872 context_gpu.cu:325] Total: 20748 MB
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_0/retnet_bbox_pred_fpn3_b_grad" input: "gpu_2/retnet_bbox_pred_fpn3_b_grad" output: "gpu_0/retnet_bbox_pred_fpn3_b_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
  what():  [enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_2/retnet_cls_conv_n3_fpn3" input: "gpu_2/__m13_shared" output: "gpu_2/__m13_shared" name: "" type: "ReluGradient" arg { name: "cudnn_exhaustive_search" i: 0 } arg { name: "order" s: "NCHW" } device_option { device_type: 1 cuda_gpu_id: 2 } engine: "CUDNN" is_gradient_op: true
*** Aborted at 1517072808 (unix time) try "date -d @1517072808" if you are using GNU date ***
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
PC: @     0x7fd71f6bd428 gsignal
*** SIGABRT (@0x3e900002a18) received by PID 10776 (TID 0x7fd548e3d700) from PID 10776; stack trace: ***
    @     0x7fd71fa63390 (unknown)
    @     0x7fd71f6bd428 gsignal
    @     0x7fd71f6bf02a abort
    @     0x7fd71b51c84d __gnu_cxx::__verbose_terminate_handler()
    @     0x7fd71b51a6b6 (unknown)
    @     0x7fd71b51a701 std::terminate()
    @     0x7fd71b545d38 (unknown)
    @     0x7fd71fa596ba start_thread
    @     0x7fd71f78f41d clone
    @                0x0 (unknown)
./itrain4.sh: line 9: 10776 Aborted                 (core dumped) python2 tools/train_net.py --multi-gpu-testing --cfg configs/iret-rn50-fpn-voc.yaml OUTPUT_DIR ./output
rbgirshick commented 6 years ago

@coolbrain, @tshizys: one shot in the dark is to switch the all-reduce implementation to nccl by passing USE_NCCL True to train_net.py, as in:

python2 tools/train_net.py --multi-gpu-testing \
  --cfg configs/getting_started/tutorial_2gpu_e2e_faster_rcnn_R-50-FPN.yaml \
  OUTPUT_DIR /tmp/output USE_NCCL True

This will require Caffe2 to have been built with nccl ops -- I'm not sure if this is done by default or will require some work to rebuild Caffe2 with nccl support.

yousongzhu commented 6 years ago

@rbgirshick , when using two GPUs, i.e. {0,2}, {0,3}, {1,2}, {1,3}, the error still exists. Here is the details, using {0,3} and training RetinaNet (retinanet_R-101-FPN) for example:

F0128 12:09:08.461153 4938 context_gpu.cu:387] Error at: /home/yszhu/local/caffe2/caffe2/core/context_gpu.cu:387: an illegal memory access was encountered Check failure stack trace: terminate called recursively terminate called recursively Aborted at 1517112548 (unix time) try "date -d @1517112548" if you are using GNU date terminate called after throwing an instance of 'caffe2::EnforceNotMet' what(): [enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator: input: "gpu_0/fpn_6_relu" input: "gpu_0/fpn_7_w" input: "gpu_0/m23_shared" output: "gpu_0/fpn_7_w_grad" output: "gpu_0/fpn_7_b_grad" output: "gpu_0/m22_shared" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 2 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN" is_gradient_op: true @ 0x7f2bdf712772 google::LogMessage::Fail() PC: @ 0x0 (unknown) SIGABRT (@0x3e8000012b7) received by PID 4791 (TID 0x7f2a6effd700) from PID 4791; stack trace: @ 0x7f2bdf7126ce google::LogMessage::SendToLog() @ 0x7f2c2670e130 (unknown) @ 0x7f2bdf71204c google::LogMessage::Flush() @ 0x7f2c25c6a5d7 GI_raise @ 0x7f2bdf71556d google::LogMessageFatal::~LogMessageFatal() @ 0x7f2c25c6bcc8 __GI_abort @ 0x7f2c1b1b1965 gnu_cxx::verbose_terminate_handler() @ 0x7f2bdfdd1180 caffe2::CUDAContext::Delete() @ 0x7f2c1b1af946 (unknown) @ 0x7f2be27f42d9 std::_Sp_counted_base<>::_M_release() @ 0x7f2c1b1af973 std::terminate() @ 0x7f2c1b2062c5 (unknown) @ 0x7f2bdfd377d1 caffe2::Tensor<>::ResizeLike<>() @ 0x7f2c26706df5 start_thread @ 0x7f2bdfd6e3e2 _ZN6caffe210CuDNNState7executeIRZNS_19CudnnConvGradientOp13DoRunWithTypeIffffffffEEbvEUlPS0_E1_EEvP11CUstreamstOT @ 0x7f2c25d2b1ad clone @ 0x7f2bdfd707e1 caffe2::CudnnConvGradientOp::DoRunWithType<>() @ 0x0 (unknown)

image

The forms of error are not the same each time, but it's just "Encountered CUDA error: an illegal memory access was encountered".

yousongzhu commented 6 years ago

I also rebuild caffe2 with nccl-1.3.5 (following https://caffe2.ai/docs/getting-started.html?platform=centos&configuration=cloud#null__troubleshooting):

image

and switch the all-reduce implementation to nccl by passing USE_NCCL True to train_net.py, as in:

python2 tools/train_net.py --multi-gpu-testing \ --cfg configs/12_2017_baselines/retinanet_R-101-FPN_1x_4gpus.yaml \ OUTPUT_DIR results_retinanet_R-101-FPN_1x_4gpus_model USE_NCCL True

The error disappeared ^--^ for both using four GPUs {0,1,2,3} or any of two GPUs {0,2}, {0,3}, {1,2}, {1,3}. @rbgirshick ,thanks very much.

lwher commented 6 years ago

Hi, I open the nccl op to train the tutorial_network and the error above disappeared. However, the program hangs after loading data and occupy 100% CPU all the time.

....... I0129 03:25:13.106998 118074 context_gpu.cu:321] GPU 0: 2175 MB I0129 03:25:13.107028 118074 context_gpu.cu:321] GPU 1: 2078 MB I0129 03:25:13.107045 118074 context_gpu.cu:321] GPU 2: 2266 MB I0129 03:25:13.107059 118074 context_gpu.cu:321] GPU 3: 1860 MB I0129 03:25:13.107072 118074 context_gpu.cu:325] Total: 8381 MB I0129 03:25:13.122316 118079 context_gpu.cu:321] GPU 0: 2195 MB I0129 03:25:13.122344 118079 context_gpu.cu:321] GPU 1: 2145 MB I0129 03:25:13.122361 118079 context_gpu.cu:321] GPU 2: 2267 MB I0129 03:25:13.122378 118079 context_gpu.cu:321] GPU 3: 1924 MB I0129 03:25:13.122395 118079 context_gpu.cu:325] Total: 8532 MB I0129 03:25:13.151623 118079 context_gpu.cu:321] GPU 0: 2245 MB I0129 03:25:13.151650 118079 context_gpu.cu:321] GPU 1: 2159 MB I0129 03:25:13.152823 118079 context_gpu.cu:321] GPU 2: 2269 MB I0129 03:25:13.153623 118079 context_gpu.cu:321] GPU 3: 2020 MB I0129 03:25:13.154454 118079 context_gpu.cu:325] Total: 8694 MB I0129 03:25:13.186017 118079 context_gpu.cu:321] GPU 0: 2260 MB I0129 03:25:13.186053 118079 context_gpu.cu:321] GPU 1: 2214 MB I0129 03:25:13.186067 118079 context_gpu.cu:321] GPU 2: 2279 MB I0129 03:25:13.186077 118079 context_gpu.cu:321] GPU 3: 2080 MB I0129 03:25:13.186089 118079 context_gpu.cu:325] Total: 8835 MB I0129 03:25:13.215306 118076 context_gpu.cu:321] GPU 0: 2310 MB I0129 03:25:13.215342 118076 context_gpu.cu:321] GPU 1: 2269 MB I0129 03:25:13.215351 118076 context_gpu.cu:321] GPU 2: 2308 MB I0129 03:25:13.215368 118076 context_gpu.cu:321] GPU 3: 2081 MB I0129 03:25:13.215384 118076 context_gpu.cu:325] Total: 8970 MB I0129 03:25:13.307595 118084 context_gpu.cu:321] GPU 0: 2310 MB I0129 03:25:13.307623 118084 context_gpu.cu:321] GPU 1: 2301 MB I0129 03:25:13.307641 118084 context_gpu.cu:321] GPU 2: 2391 MB I0129 03:25:13.307652 118084 context_gpu.cu:321] GPU 3: 2104 MB I0129 03:25:13.307665 118084 context_gpu.cu:325] Total: 9108 MB I0129 03:25:13.324935 118077 context_gpu.cu:321] GPU 0: 2312 MB I0129 03:25:13.324965 118077 context_gpu.cu:321] GPU 1: 2313 MB I0129 03:25:13.324982 118077 context_gpu.cu:321] GPU 2: 2452 MB I0129 03:25:13.324993 118077 context_gpu.cu:321] GPU 3: 2171 MB I0129 03:25:13.325011 118077 context_gpu.cu:325] Total: 9250 MB I0129 03:25:13.343673 118080 context_gpu.cu:321] GPU 0: 2336 MB I0129 03:25:13.343698 118080 context_gpu.cu:321] GPU 1: 2380 MB I0129 03:25:13.343715 118080 context_gpu.cu:321] GPU 2: 2468 MB I0129 03:25:13.343731 118080 context_gpu.cu:321] GPU 3: 2233 MB I0129 03:25:13.343747 118080 context_gpu.cu:325] Total: 9417 MB I0129 03:25:13.369802 118085 cuda_nccl_gpu.cc:110] Creating NCCLContext for key: 0:0,1,2,3, I0129 03:25:13.381914 118076 context_gpu.cu:321] GPU 0: 2361 MB I0129 03:25:13.381942 118076 context_gpu.cu:321] GPU 1: 2453 MB I0129 03:25:13.381961 118076 context_gpu.cu:321] GPU 2: 2524 MB I0129 03:25:13.381978 118076 context_gpu.cu:321] GPU 3: 2247 MB I0129 03:25:13.381995 118076 context_gpu.cu:325] Total: 9587 MB I0129 03:25:13.613253 118083 context_gpu.cu:321] GPU 0: 2388 MB I0129 03:25:13.613292 118083 context_gpu.cu:321] GPU 1: 2525 MB I0129 03:25:13.613301 118083 context_gpu.cu:321] GPU 2: 2524 MB I0129 03:25:13.613308 118083 context_gpu.cu:321] GPU 3: 2310 MB I0129 03:25:13.613315 118083 context_gpu.cu:325] Total: 9748 MB

the program hangs......

my environment: Operating system: Ubuntu 16.04 Compiler version: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 CUDA version: 8.0 cuDNN version: v5.1 NVIDIA driver version: 384.111

nvidia-smi: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.111 Driver Version: 384.111 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+| | 0 Tesla M60 Off | 00001543:00:00.0 Off | Off | | N/A 42C P0 41W / 150W | 0MiB / 8123MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla M60 Off | 00003134:00:00.0 Off | Off | | N/A 42C P0 39W / 150W | 0MiB / 8123MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla M60 Off | 00004975:00:00.0 Off | Off | | N/A 38C P0 41W / 150W | 0MiB / 8123MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla M60 Off | 0000F3E6:00:00.0 Off | Off | | N/A 38C P0 40W / 150W | 0MiB / 8123MiB | 0% Default | +-------------------------------+----------------------+----------------------+

rbgirshick commented 6 years ago

@lwher: that's unfortunate. The reason we don't use NCCL by default is that it's prone to causing deadlocks, which is what I think you're seeing.

zdwong commented 6 years ago

After rebuilding caffe2 with NCCL, I rerun the program with this script: python tools/train_net.py \ --multi-gpu-testing \ --cfg configs/getting_started/tutorial_4gpu_e2e_faster_rcnn_R-50-FPN.yaml \ OUTPUT_DIR ./output USE_NCCL True

It throws this error:

Creating NCCLContext for key: 0:0,1,2,3, !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! WARNING:

You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in GDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed). !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! terminate called after throwing an instance of 'caffe2::EnforceNotMet' what(): [enforce fail at cuda_nccl_gpu.cc:40] status == ncclSuccess. 2 vs 0. Error at: /mnt/hzhida/project/caffe2/caffe2/contrib/nccl/cuda_nccl_gpu.cc40: system error Error from operator: input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" input: "gpu_2/rpn_cls_logits_fpn2_w_grad" input: "gpu_3/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" output: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_2/rpn_cls_logits_fpn2_w_grad" output: "gpu_3/rpn_cls_logits_fpn2_w_grad" name: "" type: "NCCLAllreduce" device_option { device_type: 1 cuda_gpu_id: 0 } Aborted at 1517210588 (unix time) try "date -d @1517210588" if you are using GNU date PC: @ 0x7ff1e0383428 gsignal SIGABRT (@0x3e800007a46) received by PID 31302 (TID 0x7fefb5ffb700) from PID 31302; stack trace: I0129 07:23:08.187249 31591 cuda_nccl_gpu.cc:110] Creating NCCLContext for key: 0:0,1,2,3,

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! WARNING:

You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in GDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed). !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! terminate called recursively @ 0x7ff1e0729390 (unknown) I0129 07:23:08.188051 31592 context_gpu.cu:321] GPU 0: 2466 MB I0129 07:23:08.188074 31592 context_gpu.cu:321] GPU 1: 2387 MB I0129 07:23:08.188091 31592 context_gpu.cu:321] GPU 2: 2311 MB I0129 07:23:08.188099 31592 context_gpu.cu:321] GPU 3: 2382 MB I0129 07:23:08.188107 31592 context_gpu.cu:325] Total: 9548 MB @ 0x7ff1e0383428 gsignal @ 0x7ff1e038502a abort @ 0x7ff1da16284d __gnu_cxx::__verbose_terminate_handler() @ 0x7ff1da1606b6 (unknown) @ 0x7ff1da160701 std::terminate() @ 0x7ff1da18bd38 (unknown) @ 0x7ff1e071f6ba start_thread @ 0x7ff1e045541d clone @ 0x0 (unknown) Aborted (core dumped)

Running Environment: Operating system: Ubuntu 16.04 Compiler version: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 CUDA version: 8.0 cuDNN version: v5.1 NVIDIA driver version: 384.111

nvidia-smi: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.111 Driver Version: 384.111 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+| | 0 Tesla M60 Off | 00001543:00:00.0 Off | Off | | N/A 42C P0 41W / 150W | 0MiB / 8123MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla M60 Off | 00003134:00:00.0 Off | Off | | N/A 42C P0 39W / 150W | 0MiB / 8123MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla M60 Off | 00004975:00:00.0 Off | Off | | N/A 38C P0 41W / 150W | 0MiB / 8123MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla M60 Off | 0000F3E6:00:00.0 Off | Off | | N/A 38C P0 40W / 150W | 0MiB / 8123MiB | 0% Default | +-------------------------------+----------------------+----------------------+

ir413 commented 6 years ago

One additional note about NCCL: Caffe2 builds with NCCL by default so there is no need to rebuild it.

Yangqing commented 6 years ago

Jumping onto this: since the illegal memory access is from the Add operator, you might want to check if direct peer access is available between the gpus that you are using. Current Add op relies on that, and if not we might want to fix the code indeed. Basically, to do so, in python, do:

from caffe2.python import workspace
print(workspace.GetCudaPeerAccessPattern())

Could you paste the output of that for debugging? (Especially, if you are using CUDA_VISIBLE_DEVICES, make sure you invoke python with that too)

jwnsu commented 6 years ago

@Yangqing output from your two debug lines:

[[ True  True False False]
 [ True  True False False]
 [False False  True  True]
 [False False  True  True]]

thx for looking into this issue (and ... caffe/caffe2 frameworks!)

Yangqing commented 6 years ago

@jwnsu thanks! Just to confirm, so the Add operator is adding tensors across gpu {0,1} and {2,3} right? (I assume it is adding stuff together from the 4 gpus).

jwnsu commented 6 years ago

It's 4 gpus config, with GPU ids specified as "0,1,2,4" (via CUDA_VISIBLE_DEVICES.) If GPU ids are configured as "0,1,2,3" (lowest GPU ids), it works fine without any error.

Liang-Sen commented 6 years ago

@Yangqing My Linux Server have 4 M60 GPUs, This is my workspace.GetCudaPeerAccessPattern() output: [[ True False False False] [False True False False] [False False True False] [False False False True]]

I can train net using 1 gpu well, but when I train net using 2 or 4 GPUS, I meet problems the same above, even I set NCCL = True

Yangqing commented 6 years ago

Thanks guys. This verifies my assumption that the illegal memory access comes from the Add op not properly handling cross-device communications when peer access is not enabled. Will issue a fix.

JohnnyGambler commented 6 years ago

same problem in cross-device communications... this machine can use 4 GPU[0,1,2,3]: image this machine can use [0,1] and [2,3]: image

BTW, I have use 12 Cpu and 4 titan x to train 3D Faster RCNN in pytorch framework . Why Pytorch doesn't have this problem ????

zdwong commented 6 years ago

@Yangqing Because I can't train Detectron in multi-GPU, so I want to know how long will you fix Cross-GPU communications problem? thanks.

blateyang commented 6 years ago

@Yangqing I ran into similar problems as above. My Linux workstation has 2 GTX-1080Ti. The error infos are as follow: [enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator: input: "gpu_0/rpn_cls_logits_fpn2_b_grad" input: "gpu_1/rpn_cls_logits_fpn2_b_grad" output: "gpu_0/rpn_cls_logits_fpn2_b_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 } and my workspace.GetCudaPeerAccessPattern() output is: [[True False] [False True]] Whether is it a Cross-GPU communications problem too? If not, anyone can help me fix it,thanks.

zdwong commented 6 years ago

Yes,it is the same problem. The gradients in cross-GPU can't add together because GPUs can't communicate with each other. if you want to solve the problem, maybe you could copy the gradients from GPU to CPU, then sum them up together and average them. And at last, copy average gradient from CPU to GPU. @blateyang

blateyang commented 6 years ago

Thanks for your advice! @coolbrain But I can't understand why some people can successfully train model with two or more GPUs. Haven't they met the same Cross-GPU communications problem?

jwnsu commented 6 years ago

Training of 4 GPUs with either lowest GPU ids (0,1,2,3) or highest GPU ids (4,5,6,7) works here without any error (8 gpus might work too, but have not tried it yet.) It only has issue with mix of particular ids, e.g. "0,1,2,4" or "1,3,5,7".

Suspect caffe2 cross-GPU communication issue may behave differently with individual hardware build (rbgirshick mentioned earlier Facebook M40 server works with mix of ids too).

Tangshitao commented 6 years ago

Come across the same problem. Is this fixed?

yuzcccc commented 6 years ago

I met the same problem on a workstation with 4 GTX 1080TI GPUS. Multi-gpu works well on other platform, such as caffe and tensorflow. This is my workspace.GetCudaPeerAccessPattern() output: [[ True True False False] [ True True False False] [False False True True] [False False True True]] The two-gpu config (with {0,1} or {2,3}) works well. Three or Four gpus will face the aforementioned problem. However, My error is not on the Add operation, I remember the type is Copy

fliman commented 6 years ago

Has the issue been fixed?

II-Matto commented 6 years ago

@rbgirshick Hi, I met the same problem as @lwher. The program seems to get stuck with almost a 50% chance with NCCL on my machine with Ubuntu 14.04 and 4 GPUs. Is there a solution to avoid such behaviors of NCCL? Many thanks!

xieshuqin commented 6 years ago

@Yangqing Hi, I met the same issue in the Copy operator. When I don't add the USE_NCCL True flag, the errors are as follows:

E0325 02:26:02.258566  8284 operator_schema.cc:73] Input index 0 and output idx 0 (gpu_0/res3_0_branch2a_w_grad) are set to be in-place but this is actually not supported by op Copy
Original python traceback for operator 2817 in network `generalized_rcnn` in exception above (most recent call last):
  File "tools/train_net.py", line 358, in <module>
  File "tools/train_net.py", line 196, in main
  File "tools/train_net.py", line 205, in train_model
  File "tools/train_net.py", line 283, in create_model
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 120, in create
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 92, in generalized_rcnn
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 254, in build_generic_detection_model
  File "/home/shuqin/git/RefineNet/lib/modeling/optimizer.py", line 42, in build_data_parallel_model
  File "/home/shuqin/git/RefineNet/lib/modeling/optimizer.py", line 84, in _add_allreduce_graph
  File "/home/shuqin/git/caffe2/build/caffe2/python/muji.py", line 64, in Allreduce
  File "/home/shuqin/git/caffe2/build/caffe2/python/muji.py", line 204, in AllreduceFallback
Traceback (most recent call last):
  File "tools/train_net.py", line 358, in <module>
    main()
  File "tools/train_net.py", line 196, in main
    checkpoints = train_model()
  File "tools/train_net.py", line 210, in train_model
    setup_model_for_training(model, output_dir)
  File "tools/train_net.py", line 316, in setup_model_for_training
    workspace.CreateNet(model.net)
  File "/home/shuqin/git/caffe2/build/caffe2/python/workspace.py", line 166, in CreateNet
    StringifyProto(net), overwrite,
  File "/home/shuqin/git/caffe2/build/caffe2/python/workspace.py", line 192, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [enforce fail at operator.cc:125] schema->Verify(operator_def). Operator def did not pass schema checking: input: "gpu_0/res3_0_branch2a_w_grad" output: "gpu_0/res3_0_branch2a_w_grad" name: "" type: "Copy" device_option { device_type: 1 cuda_gpu_id: 0 }

If I added the USE_NCCL True flag, the errors then become:

Original python traceback for operator 2928 in network `generalized_rcnn` in exception above (most recent call last):
  File "tools/train_net.py", line 358, in <module>
  File "tools/train_net.py", line 196, in main
  File "tools/train_net.py", line 205, in train_model
  File "tools/train_net.py", line 283, in create_model
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 120, in create
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 92, in generalized_rcnn
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 254, in build_generic_detection_model
  File "/home/shuqin/git/RefineNet/lib/modeling/optimizer.py", line 42, in build_data_parallel_model
  File "/home/shuqin/git/RefineNet/lib/modeling/optimizer.py", line 82, in _add_allreduce_graph
Traceback (most recent call last):
  File "tools/train_net.py", line 358, in <module>
    main()
  File "tools/train_net.py", line 196, in main
    checkpoints = train_model()
  File "tools/train_net.py", line 217, in train_model
    workspace.RunNet(model.net.Proto().name)
  File "/home/shuqin/git/caffe2/build/caffe2/python/workspace.py", line 230, in RunNet
    StringifyNetName(name), num_iter, allow_fail,
  File "/home/shuqin/git/caffe2/build/caffe2/python/workspace.py", line 192, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [enforce fail at cuda_nccl_gpu.cc:40] status == ncclSuccess. 2 vs 0.  Error at: /home/shuqin/git/caffe2/caffe2/contrib/nccl/cuda_nccl_gpu.cc40: system error Error from operator:
input: "gpu_0/rpn_cls_logits_fpn2_b_grad" input: "gpu_1/rpn_cls_logits_fpn2_b_grad" input: "gpu_2/rpn_cls_logits_fpn2_b_grad" output: "gpu_0/rpn_cls_logits_fpn2_b_grad" output: "gpu_1/rpn_cls_logits_fpn2_b_grad" output: "gpu_2/rpn_cls_logits_fpn2_b_grad" name: "" type: "NCCLAllreduce" device_option { device_type: 1 cuda_gpu_id: 0 }

My system is Ubuntu 14.04, with Cuda 8.0 and Cudnn 5.1 . My machine has 8 GPUs but I tested the code only on the last 4, so the communication between GPUs should be no problem. I use NCCL 2.1.15 for CUDA 8.0.

Hope this issue can be fixed soon. It's pretty annoying.

melody-rain commented 6 years ago

This problem still exists, right?

blateyang commented 6 years ago

By adding 'USE_NCLL True' when runing multi-GPU training, I successfully get my training started. Although sometimes deadlock may happen, you can try to modify some training params such as learning rate to solve it.

pkuxwguan commented 6 years ago

The problem still exists.

pkuxwguan commented 6 years ago

@xieshuqin I met the same problem 'status == ncclSuccess. 2 vs 0.' with you when use 'USE_NCCL True'.How do you solve this problem?Thanks

xieshuqin commented 6 years ago

@pkuxwguan My issue has been fixed but I forgot how did I fix it. Sorry about that. But I do remember the problem should be related to the wrong installation of NCCL.

daquexian commented 6 years ago

Hi all, I also suffered from this issue, so I finally fixed it by myself. https://github.com/pytorch/pytorch/pull/6896 solved this issue :)

illutheplanet commented 6 years ago

anybody tells me whether I can run mask r-cnn with only one GPU?

yuzcccc commented 6 years ago

@daquexian I tried your PR, it works!!! Thanks very much

Feynman27 commented 6 years ago

@daquexian This PR doesn't appear to work for me. I'm experiencing deadlocks while using a single GPU without NCCL and also while using 2 GPUs with USE_NCCL True. After changing muji.py according to your PR and running with 2 GPUs with USE_NCCL True, I'm still experiencing a deadlock; the training just pauses at random iteration numbers.

daquexian commented 6 years ago

Thanks for your trying :) You don't need to set USE_NCCL=True if you use my pr. NCCL and "muji" are two different gpu communication methods. My pr is a patch for muji, which required gpu peer access before, and not for nccl. Just set USE_NCCL=False and my pr will work.

On Wed, May 2, 2018, 2:51 AM Thomas Balestri notifications@github.com wrote:

@daquexian https://github.com/daquexian This PR doesn't appear to work for me. I'm experiencing deadlocks while using a single GPU without NCCL and also while using 2 GPUs with USE_NCCL True. After changing muji.py according to your PR and running with 2 GPUs with USE_NCCL True, I'm still experiencing a deadlock; the training just pauses at random iteration numbers.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/Detectron/issues/32#issuecomment-385755468, or mute the thread https://github.com/notifications/unsubscribe-auth/ALEcn2nGO9e-fIF8S3bTDNkK4370hjOVks5tuK7DgaJpZM4Rsc8n .

Feynman27 commented 6 years ago

Maybe I'm missing something, but if I set USE_NCCL=False, and use your modified muji.py and muji_test.py PR, I get the original error:

I0502 14:35:57.192476 79712 context_gpu.cu:318] Total: 23025 MB
E0502 14:35:58.382604 79711 net_dag.cc:195] Exception from operator chain starting at '' (type 'Add'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_0/rpn_cls_logits_fpn2_b_grad" input: "gpu_1/rpn_cls_logits_fpn2_b_grad" output: "gpu_0/rpn_cls_logits_fpn2_b_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
E0502 14:35:58.382622 79712 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'Add'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
F0502 14:35:58.382670 79711 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
F0502 14:35:58.382683 79712 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
E0502 14:35:58.383510 79709 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_1/fpn_res3_3_sum" input: "gpu_1/conv_rpn_fpn2_w" input: "gpu_1/__m18_shared" output: "_gpu_1/conv_rpn_fpn2_w_grad_autosplit_2" output: "_gpu_1/conv_rpn_fpn2_b_grad_autosplit_2" output: "_gpu_1/fpn_res3_3_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 1 } engine: "CUDNN" is_gradient_op: true
E0502 14:35:58.383541 79713 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at conv_op_cudnn.cc:1290] status == CUDNN_STATUS_SUCCESS. 8 vs 0. , Error at: /home/markable-ai/pytorch/caffe2/operators/conv_op_cudnn.cc:1290: CUDNN_STATUS_EXECUTION_FAILED Error from operator: 
input: "gpu_3/conv_rpn_fpn4" input: "gpu_3/rpn_bbox_pred_fpn2_w" input: "gpu_3/rpn_bbox_pred_fpn4_grad" output: "_gpu_3/rpn_bbox_pred_fpn2_w_grad_autosplit_1" output: "_gpu_3/rpn_bbox_pred_fpn2_b_grad_autosplit_1" output: "gpu_3/__m13_shared" name: "" type: "ConvGradient" arg { name: "kernel" i: 1 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 0 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN" is_gradient_op: true
E0502 14:35:58.383591 79706 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_3/conv_rpn_fpn3" input: "gpu_3/rpn_cls_logits_fpn2_w" input: "gpu_3/rpn_cls_logits_fpn3_grad" output: "_gpu_3/rpn_cls_logits_fpn2_w_grad_autosplit_2" output: "_gpu_3/rpn_cls_logits_fpn2_b_grad_autosplit_2" output: "_gpu_3/conv_rpn_fpn3_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 1 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 0 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN" is_gradient_op: true
F0502 14:35:58.382683 79712 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encounteredF0502 14:35:58.434631 79709 context_gpu.h:107] FCheck failed: error == cudaSuccess an illegal memory access was encountered0502 14:35:58.434648 79713 c*** Check failure stack trace: ***
E0502 14:35:58.383741 79700 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_3/conv_rpn_fpn2" input: "gpu_3/rpn_cls_logits_fpn2_w" input: "gpu_3/rpn_cls_logits_fpn2_grad" output: "_gpu_3/rpn_cls_logits_fpn2_w_grad_autosplit_3" output: "_gpu_3/rpn_cls_logits_fpn2_b_grad_autosplit_3" output: "_gpu_3/conv_rpn_fpn2_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 1 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 0 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN" is_gradient_op: true
Aborted (core dumped)

I'm using Cuda 9.1, cudnn 7.1 with 4 V100s.

daquexian commented 6 years ago

@Feynman27 Could you tell me which branch(like Allreduce4, Allreduce4Group2, Allreduce2 or others) of Allreduce in the updated muji.py is entered? You might want to add some print functions in these branch to know it. And what if you replace the implementation of Allreduce by just calling AllreduceFallback? It will be great if you can also provide your gpu access pattern like https://github.com/facebookresearch/Detectron/issues/32#issuecomment-361739340. Thanks!

Feynman27 commented 6 years ago

Allreduce4 is being called. The gpu access pattern is:

>>> from caffe2.python import workspace
>>> print(workspace.GetCudaPeerAccessPattern())
[[ True False False False]
 [False  True False False]
 [False False  True False]
 [False False False  True]]

I'll try calling AllreduceFallback.

Feynman27 commented 6 years ago

Calling AllreduceFallback gives a similar error as above:

I0502 17:08:51.294476 88651 context_gpu.cu:318] Total: 22524 MB
E0502 17:08:52.009866 88659 net_dag.cc:195] Exception from operator chain starting at '' (type 'Add'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
F0502 17:08:52.009990 88659 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
E0502 17:08:52.010440 88651 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_2/fpn_res3_3_sum" input: "gpu_2/conv_rpn_fpn2_w" input: "gpu_2/__m15_shared" output: "_gpu_2/conv_rpn_fpn2_w_grad_autosplit_2" output: "_gpu_2/conv_rpn_fpn2_b_grad_autosplit_2" output: "_gpu_2/fpn_res3_3_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 2 } engine: "CUDNN" is_gradient_op: true
E0502 17:08:52.010524 88663 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_1/fpn_res2_2_sum" input: "gpu_1/conv_rpn_fpn2_w" input: "gpu_1/__m12_shared" output: "_gpu_1/conv_rpn_fpn2_w_grad_autosplit_3" output: "_gpu_1/conv_rpn_fpn2_b_grad_autosplit_3" output: "_gpu_1/fpn_res2_2_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 1 } engine: "CUDNN" is_gradient_op: true
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encountered
*** Check failure stack trace: ***
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encounteredF0502 17:08:52.061641 88651 context_gpu.hF107] 502 17:Ch:ck failed: error == cudaSuccess 52.061651 88663 context_gpu.h:
E0502 17:08:52.010577 88653 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_0/fpn_res4_22_sum" input: "gpu_0/conv_rpn_fpn2_w" input: "gpu_0/__m15_shared" output: "_gpu_0/conv_rpn_fpn2_w_grad_autosplit_1" output: "_gpu_0/conv_rpn_fpn2_b_grad_autosplit_1" output: "_gpu_0/fpn_res4_22_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN" is_gradient_op: true
*** Check failure stack trace: ***
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encounteredF0502 17:08:52.061641 88651 context_gpu.hF107] 502 17:Ch:ck failed: error == cudaSuccess 52.061651 88663 context_gpu.h:
07] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encounteredF0502 17:08:52.061641 88651 context_gpu.hF107] 502 17:Ch:ck failed: error == cudaSuccess 52.061651 88663 context_gpu.h:
07] Check failed: error == cudaSuccess an illegal memory access was encounteredF0502 17:08:52.061749 88653 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
Aborted (core dumped
daquexian commented 6 years ago

@Feynman27 It's strange. According to your gpu access pattern, AllreduceFallback instead of Allreduce4 will be called. And when you called AllreduceFallback manually, the error message doesn't appear to be came from AllreduceFallback. Did you change the muji.py in right folder? For example, if the python package of caffe2 is in /usr/lib/python/site-packages/caffe2, then changing the muji.py in caffe2's source folder(like ~/caffe2/python) will not work.

yuzcccc commented 6 years ago

@Feynman27 did you rebuild the caffe2 ?