facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
Apache License 2.0
26.26k stars 5.45k forks source link

double free or corruption (out) on docker #431

Open tereka114 opened 6 years ago

tereka114 commented 6 years ago

I am trying to train using detectron for custom model, But Sometimes Error is occurred on training.

Question

1.I use official docker. In container, I exec program, and that process use 4 GPUS on host (One gpu is most usage, but other gpu is a little). Why detectron(caffe2) use 4 gpus(all?)?

2.I wanna solve the problem for long time traning. Please advice for solving.

I write detail as follows

Expected results

finish training in the end.

Actual results

Learning stops halfway.

Error Message

json_stats: {"accuracy_cls": 0.998973, "eta": "0:10:38", "iter": 100, "loss": 0.319385, "loss_bbox": 0.000002, "loss_cls": 0.009161, "loss_mask": 0.257079, "loss_rpn_bbox_fpn2": 0.000740, "loss_rpn_bbox_fpn3": 0.000000, "loss_rpn_bbox_fpn4": 0.000000, "loss_rpn_bbox_fpn5": 0.000000, "loss_rpn_bbox_fpn6": 0.000000, "loss_rpn_cls_fpn2": 0.038432, "loss_rpn_cls_fpn3": 0.007352, "loss_rpn_cls_fpn4": 0.006623, "loss_rpn_cls_fpn5": 0.000000, "loss_rpn_cls_fpn6": 0.000000, "lr": 0.000467, "mb_qsize": 64, "mem": 1473, "time": 0.709308}
json_stats: {"accuracy_cls": 0.998983, "eta": "0:10:27", "iter": 120, "loss": 0.279284, "loss_bbox": 0.000002, "loss_cls": 0.008208, "loss_mask": 0.222271, "loss_rpn_bbox_fpn2": 0.000781, "loss_rpn_bbox_fpn3": 0.000000, "loss_rpn_bbox_fpn4": 0.000000, "loss_rpn_bbox_fpn5": 0.000000, "loss_rpn_bbox_fpn6": 0.000000, "loss_rpn_cls_fpn2": 0.037556, "loss_rpn_cls_fpn3": 0.004894, "loss_rpn_cls_fpn4": 0.004769, "loss_rpn_cls_fpn5": 0.000000, "loss_rpn_cls_fpn6": 0.000000, "lr": 0.000493, "mb_qsize": 64, "mem": 1473, "time": 0.712681}
json_stats: {"accuracy_cls": 0.998695, "eta": "0:10:13", "iter": 140, "loss": 0.293388, "loss_bbox": 0.000001, "loss_cls": 0.009952, "loss_mask": 0.236902, "loss_rpn_bbox_fpn2": 0.000819, "loss_rpn_bbox_fpn3": 0.000000, "loss_rpn_bbox_fpn4": 0.000000, "loss_rpn_bbox_fpn5": 0.000000, "loss_rpn_bbox_fpn6": 0.000000, "loss_rpn_cls_fpn2": 0.036763, "loss_rpn_cls_fpn3": 0.003173, "loss_rpn_cls_fpn4": 0.005055, "loss_rpn_cls_fpn5": 0.000000, "loss_rpn_cls_fpn6": 0.000000, "lr": 0.000520, "mb_qsize": 64, "mem": 1473, "time": 0.713324}
json_stats: {"accuracy_cls": 0.998913, "eta": "0:10:00", "iter": 160, "loss": 0.308416, "loss_bbox": 0.000002, "loss_cls": 0.008798, "loss_mask": 0.254004, "loss_rpn_bbox_fpn2": 0.000719, "loss_rpn_bbox_fpn3": 0.000000, "loss_rpn_bbox_fpn4": 0.000000, "loss_rpn_bbox_fpn5": 0.000000, "loss_rpn_bbox_fpn6": 0.000000, "loss_rpn_cls_fpn2": 0.036889, "loss_rpn_cls_fpn3": 0.003224, "loss_rpn_cls_fpn4": 0.005176, "loss_rpn_cls_fpn5": 0.000000, "loss_rpn_cls_fpn6": 0.000000, "lr": 0.000547, "mb_qsize": 64, "mem": 1473, "time": 0.714428}
*** Error in `python2': double free or corruption (out): 0x00007f96540ac840 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f97d4c887e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f97d4c9137a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f97d4c9553c]
/usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so(+0x1edef)[0x7f97cc71cdef]
/usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so(+0x22032)[0x7f97cc720032]
python2(PyEval_EvalFrameEx+0x6162)[0x4ca0d2]
python2(PyEval_EvalFrameEx+0x5e0f)[0x4c9d7f]
python2(PyEval_EvalCodeEx+0x255)[0x4c2705]
python2[0x4de69e]
python2(PyObject_Call+0x43)[0x4b0c93]
python2[0x4f452e]
python2(PyObject_Call+0x43)[0x4b0c93]
python2(PyEval_CallObjectWithKeywords+0x30)[0x4ce540]
/usr/local/caffe2_build/caffe2/python/caffe2_pybind11_state_gpu.so(+0x82067)[0x7f97c9b0a067]
/usr/local/caffe2_build/caffe2/python/caffe2_pybind11_state_gpu.so(+0x83eb1)[0x7f97c9b0beb1]
/usr/local/caffe2_build/caffe2/python/caffe2_pybind11_state_gpu.so(+0x4920b)[0x7f97c9ad120b]
/usr/local/caffe2_build/caffe2/python/caffe2_pybind11_state_gpu.so(+0x95ff8)[0x7f97c9b1dff8]
/usr/local/caffe2_build/caffe2/python/caffe2_pybind11_state_gpu.so(+0x925b5)[0x7f97c9b1a5b5]
/usr/local/caffe2_build/lib/libcaffe2.so(_ZN6caffe26DAGNet5RunAtEiRKSt6vectorIiSaIiEE+0x5a)[0x7f97c89688fa]
/usr/local/caffe2_build/lib/libcaffe2.so(_ZN6caffe210DAGNetBase14WorkerFunctionEv+0x37c)[0x7f97c89677ec]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f97cebe6c80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f97d4fe26ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f97d4d1841d]

*** Aborted at 1526034216 (unix time) try "date -d @1526034216" if you are using GNU date ***
PC: @     0x7f97d4c46428 gsignal
*** SIGABRT (@0x45b7) received by PID 17847 (TID 0x7f96927fc700) from PID 17847; stack trace: ***
    @     0x7f97d4fec390 (unknown)
    @     0x7f97d4c46428 gsignal
    @     0x7f97d4c4802a abort
    @     0x7f97d4c887ea (unknown)
    @     0x7f97d4c9137a (unknown)
    @     0x7f97d4c9553c cfree
    @     0x7f97cc71cdef npy_free_cache
    @     0x7f97cc720032 array_dealloc
    @           0x4ca0d2 PyEval_EvalFrameEx
    @           0x4c9d7f PyEval_EvalFrameEx
    @           0x4c2705 PyEval_EvalCodeEx
    @           0x4de69e (unknown)
    @           0x4b0c93 PyObject_Call
    @           0x4f452e (unknown)
    @           0x4b0c93 PyObject_Call
    @           0x4ce540 PyEval_CallObjectWithKeywords
    @     0x7f97c9b0a067 pybind11::detail::object_api<>::operator()<>()
    @     0x7f97c9b0beb1 caffe2::python::PythonOpBase<>::RunOnDevice()
    @     0x7f97c9ad120b caffe2::Operator<>::Run()
    @     0x7f97c9b1dff8 _ZN6caffe213GPUFallbackOpINS_6python8PythonOpINS_10CPUContextELb0EEENS_11SkipIndicesIJEEEE11RunOnDeviceEv
    @     0x7f97c9b1a5b5 caffe2::Operator<>::Run()
    @     0x7f97c89688fa caffe2::DAGNet::RunAt()
    @     0x7f97c89677ec caffe2::DAGNetBase::WorkerFunction()
    @     0x7f97cebe6c80 (unknown)
    @     0x7f97d4fe26ba start_thread
    @     0x7f97d4d1841d clone
    @                0x0 (unknown)

Result of nvidia-smi(Host)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 44% 74C P2 83W / 250W | 2088MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A |
| 37% 63C P2 88W / 250W | 1281MiB / 11178MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A |
| 49% 81C P2 182W / 250W | 11165MiB / 11178MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A |
| 40% 68C P2 127W / 250W | 2861MiB / 11178MiB | 100% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 24316 C python2 2055MiB |
| 1 17804 C python 1089MiB |
| 1 24316 C python2 159MiB |
| 2 19254 C python 10971MiB |
| 2 24316 C python2 159MiB |
| 3 24316 C python2 159MiB |
| 3 34573 C python 2653MiB |
+-----------------------------------------------------------------------------+

Result of nvidia-smi(Container)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 42% 73C P2 81W / 250W | 1896MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A |
| 36% 63C P2 89W / 250W | 1281MiB / 11178MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A |
| 48% 83C P2 175W / 250W | 11165MiB / 11178MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A |
| 39% 67C P2 86W / 250W | 2861MiB / 11178MiB | 94% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

Reproduce Command

Command

python2 detectron/tools/train_net.py --cfg detectron/configs/12_2017_baselines/mask_rcnn_R-50-FPN_1xyaml OUTPUT_DIR /tmp/detectron-output/

Settings

MODEL:
  TYPE: generalized_rcnn
  CONV_BODY: FPN.add_fpn_ResNet50_conv5_body
  NUM_CLASSES: 2
  FASTER_RCNN: True
  MASK_ON: True
NUM_GPUS: 1
SOLVER:
  WEIGHT_DECAY: 0.0001
  LR_POLICY: steps_with_decay
  BASE_LR: 0.001
  GAMMA: 0.1
  MAX_ITER: 50000
  STEPS: [0, 15000]
FPN:
  FPN_ON: True
  MULTILEVEL_ROIS: True
  MULTILEVEL_RPN: True
FAST_RCNN:
  ROI_BOX_HEAD: fast_rcnn_heads.add_roi_2mlp_head
  ROI_XFORM_METHOD: RoIAlign
  ROI_XFORM_RESOLUTION: 7
  ROI_XFORM_SAMPLING_RATIO: 2
MRCNN:
  ROI_MASK_HEAD: mask_rcnn_heads.mask_rcnn_fcn_head_v1up4convs
  RESOLUTION: 28  # (output mask resolution) default 14
  ROI_XFORM_METHOD: RoIAlign
  ROI_XFORM_RESOLUTION: 14  # default 7
  ROI_XFORM_SAMPLING_RATIO: 2  # default 0
  DILATION: 1  # default 2
  CONV_INIT: MSRAFill  # default GaussianFill
TRAIN:
  WEIGHTS: https://s3-us-west-2.amazonaws.com/detectron/ImageNetPretrained/MSRA/R-50.pkl
  DATASETS: ('dataset',) # For custom
  MAX_SIZE: 256
  BATCH_SIZE_PER_IM: 1000
  SNAPSHOT_ITERS: 1000
TEST:
  SCALE: 800
  MAX_SIZE: 1333
  NMS: 0.5
  RPN_PRE_NMS_TOP_N: 1000  # Per FPN level
  RPN_POST_NMS_TOP_N: 1000
OUTPUT_DIR: .

System information

  1. I use docker file as follows. https://github.com/facebookresearch/Detectron/blob/master/docker/Dockerfile
ir413 commented 6 years ago

Hi @tereka114, it seems like there are two issues here (1) GPUs being used and (2) double free or corruption error. Regarding (1), are you sure that Detectron is actually using 4 GPUs rather than docker? Could you try limiting docker to 1 GPU (e.g. by setting CUDA_VISIBLE_DEVICES; see also this page).

tereka114 commented 6 years ago

Hi @ir413 Thank you for your comment. I already tried CUDA_VISIBLE_DEVICES after that post(This gpu number is not used), But sometimes same error is happening.

liminghao1630 commented 6 years ago

Maybe try to change the batchsize to a low value. For example, 1, and then increase.

sadakmed commented 4 years ago

try to remove the docker-compose