facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
Apache License 2.0
26.24k stars 5.45k forks source link

Check failed: error == cudaSuccess an illegal memory access was encountered #383

Closed shiyongde closed 6 years ago

shiyongde commented 6 years ago

When i training retinanet , if i set IMS_PER_BATCH: 2 , it just use 1631MB memory。 logs like : I0419 19:38:00.144428 30279 context_gpu.cu:305] GPU 0: 1631 MB I0419 19:38:00.144462 30279 context_gpu.cu:309] Total: 1631 MB

But if i set to IMS_PER_BATCH:4 ,get error 。。。。。。 I0419 19:29:35.016454 27192 context_gpu.cu:309] Total: 5582 MB I0419 19:29:35.218364 27190 context_gpu.cu:305] GPU 0: 5712 MB I0419 19:29:35.218410 27190 context_gpu.cu:309] Total: 5712 MB E0419 19:30:51.564118 27190 net_dag.cc:188] Exception from operator chain starting at '' (type 'Conv'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:155] .## Encountered CUDA error: an illegal memory access was encountered Error from operator: input: "gpu_0/retnet_bbox_conv_n1_fpn3" input: "gpu_0/retnet_bbox_pred_fpn3_w" input: "gpu_0/retnet_bbox_pred_fpn3_b" output: "gpu_0/retnet_bbox_pred_fpn3" name: "" type: "Conv" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN" E0419 19:30:51.564137 27193 net_dag.cc:188] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:155] . Encountered CUDA error: an illegal memory access was encountered Error from operator: input: "gpu_0/retnet_cls_conv_n1_fpn5" input: "gpu_0/retnet_cls_pred_fpn3_w" input: "gpu_0/m0_shared" output: "gpu_0/retnet_cls_pred_fpn3_w_grad" output: "gpu_0/retnet_cls_pred_fpn3_b_grad" output: "gpu_0/m546_shared" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN" is_gradient_op: true F0419 19:30:51.564193 27190 context_gpu.h:106] Check failed: error == cudaSuccess an illegal memory access was encountered Check failure stack trace: F0419 19:30:51.564163 27191 context_gpu.h:106] Check failed: error == cudaSuccess an illegal memory access was encountered Check failure stack trace: F0419 19:30:51.564193 27190 context_gpu.h:106] Check failed: error == cudaSuccess an illegal memory access was encounteredF0419 19:30:51.564221 27193 context_gpu.h:106] Check failed: error == cudaSuccess an illegal memory access was encountered Check failure stack trace: Aborted (core dumped)

I using k40 , it has 11439MiB 。Is this a bug?

System information

nvidia-smi: output +-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.12 Driver Version: 387.12 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K40m Off | 00000000:02:00.0 Off | 0 | | N/A 34C P0 63W / 235W | 3757MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K40m Off | 00000000:03:00.0 Off | 0 | | N/A 34C P0 62W / 235W | 5044MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K40m Off | 00000000:83:00.0 Off | 0 | | N/A 34C P0 63W / 235W | 3534MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K40m Off | 00000000:84:00.0 Off | 0 | | N/A 46C P0 150W / 235W | 1874MiB / 11439MiB | 100% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 1 23337 C python 2514MiB | | 1 23339 C python 2514MiB | | 3 28212 C python 1863MiB | +-----------------------------------------------------------------------------+

rbgirshick commented 6 years ago

Looks like a duplicate of #32.