facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
Apache License 2.0
26.27k stars 5.46k forks source link

Keypoints training using 1GPU error #267

Closed 7oud closed 6 years ago

7oud commented 6 years ago

I try to train e2e keypoints model using 1 gpu, some modifications list below

  1. modify train_net.py to use local downloaded weights
    @@ -107,7 +107,7 @@ def main():
    -    assert_and_infer_cfg()
    +    assert_and_infer_cfg(cache_urls=False)
  2. add a new config yaml which is similar to e2e_keypoint_rcnn_R-50-FPN_1x.yaml

    NUM_GPUS: 8 -> NUM_GPUS: 1
    TRAIN:
    WEIGHTS: https://s3-us-west-2.amazonaws.com/detectron/ImageNetPretrained/MSRA/R-50.pkl ->
    WEIGHTS: /home/zoud/Workspace/Github/7oud/Detectron/model/imagenet_pretrained/R-50.pkl
    
    DATASETS: ('keypoints_coco_2014_train', 'keypoints_coco_2014_valminusminival') ->
    DATASETS: ('keypoints_coco_2014_train',)
  3. unzip the coco dataset to lib/datasets/data/coco and run command
    (caffe2) zoud@i7:~/Workspace/Github/7oud/Detectron$ python2 tools/train_net.py --cfg configs/12_2017_baselines/e2e_keypoint_rcnn_R-50-FPN_1x_1gpu.yaml OUTPUT_DIR tmp/detectron-output/

    Actual results

    
    I0310 15:33:29.882928  8712 operator.cc:173] Operator with engine CUDNN is not available for operator SafeEnqueueBlobs.
    E0310 15:33:31.579140  8743 pybind_state.h:422] Exception encountered running PythonOp function: AssertionError: Negative areas founds

At: /home/zoud/Workspace/Github/7oud/Detectron/lib/utils/boxes.py(62): boxes_area /home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/FPN.py(449): map_rois_to_fpn_levels /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(278): _distribute_rois_over_fpn_levels /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(286): _add_multilevel_rois /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(121): add_fast_rcnn_blobs /home/zoud/Workspace/Github/7oud/Detectron/lib/ops/collect_and_distribute_fpn_rpn_proposals.py(60): forward terminate called after throwing an instance of 'caffe2::EnforceNotMet' what(): [enforce fail at pybind_state.h:423] . Exception encountered running PythonOp function: AssertionError: Negative areas founds

At: /home/zoud/Workspace/Github/7oud/Detectron/lib/utils/boxes.py(62): boxes_area /home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/FPN.py(449): map_rois_to_fpn_levels /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(278): _distribute_rois_over_fpn_levels /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(286): _add_multilevel_rois /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(121): add_fast_rcnn_blobs /home/zoud/Workspace/Github/7oud/Detectron/lib/ops/collect_and_distribute_fpn_rpn_proposals.py(60): forward Error from operator: input: "gpu_0/rpn_rois_fpn2" input: "gpu_0/rpn_rois_fpn3" input: "gpu_0/rpn_rois_fpn4" input: "gpu_0/rpn_rois_fpn5" input: "gpu_0/rpn_rois_fpn6" input: "gpu_0/rpn_roi_probs_fpn2" input: "gpu_0/rpn_roi_probs_fpn3" input: "gpu_0/rpn_roi_probs_fpn4" input: "gpu_0/rpn_roi_probs_fpn5" input: "gpu_0/rpn_roi_probs_fpn6" input: "gpu_0/roidb" input: "gpu_0/im_info" output: "gpu_0/rois" output: "gpu_0/labels_int32" output: "gpu_0/bbox_targets" output: "gpu_0/bbox_inside_weights" output: "gpu_0/bbox_outside_weights" output: "gpu_0/keypoint_rois" output: "gpu_0/keypoint_locations_int32" output: "gpu_0/keypoint_weights" output: "gpu_0/keypoint_loss_normalizer" output: "gpu_0/rois_fpn2" output: "gpu_0/rois_fpn3" output: "gpu_0/rois_fpn4" output: "gpu_0/rois_fpn5" output: "gpu_0/rois_idx_restore_int32" output: "gpu_0/keypoint_rois_fpn2" output: "gpu_0/keypoint_rois_fpn3" output: "gpu_0/keypoint_rois_fpn4" output: "gpu_0/keypoint_rois_fpn5" output: "gpu_0/keypoint_rois_idx_restore_int32" name: "CollectAndDistributeFpnRpnProposalsOp:gpu_0/rpn_rois_fpn2,gpu_0/rpn_rois_fpn3,gpu_0/rpn_rois_fpn4,gpu_0/rpn_rois_fpn5,gpu_0/rpn_rois_fpn6,gpu_0/rpn_roi_probs_fpn2,gpu_0/rpn_roi_probs_fpn3,gpu_0/rpn_roi_probs_fpn4,gpu_0/rpn_roi_probs_fpn5,gpu_0/rpn_roi_probs_fpn6,gpu_0/roidb,gpu_0/im_info" type: "Python" arg { name: "grad_input_indices" } arg { name: "token" s: "forward:5" } arg { name: "grad_output_indices" } device_option { device_type: 0 } debug_info: " File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 281, in \n main()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 119, in main\n checkpoints = train_model()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 128, in train_model\n model, start_iter, checkpoints, output_dir = create_model()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 206, in create_model\n model = model_builder.create(cfg.MODEL.TYPE, train=True)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 124, in create\n return get_func(model_type_func)(model)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 89, in generalized_rcnn\n freeze_conv_body=cfg.TRAIN.FREEZE_CONV_BODY\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 229, in build_generic_detection_model\n optim.build_data_parallel_model(model, _single_gpu_build_func)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/optimizer.py\", line 40, in build_data_parallel_model\n all_loss_gradients = _build_forward_graph(model, single_gpu_build_func)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/optimizer.py\", line 63, in _build_forward_graph\n all_loss_gradients.update(single_gpu_build_func(model))\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 189, in _single_gpu_build_func\n model, blob_conv, dim_conv, spatial_scale_conv\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/rpn_heads.py\", line 44, in add_generic_rpn_outputs\n model.CollectAndDistributeFpnRpnProposals()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/detector.py\", line 223, in CollectAndDistributeFpnRpnProposals\n )(blobs_in, blobs_out, name=name)\n File \"/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/caffe2/python/core.py\", line 2137, in \n dict(chain(viewitems(kwargs), viewitems(core_kwargs)))\n File \"/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/caffe2/python/core.py\", line 2024, in _CreateAndAddToSelf\n op = CreateOperator(op_type, inputs, outputs, kwargs)\n"Error from operator: input: "gpu_0/rpn_rois_fpn2" input: "gpu_0/rpn_rois_fpn3" input: "gpu_0/rpn_rois_fpn4" input: "gpu_0/rpn_rois_fpn5" input: "gpu_0/rpn_rois_fpn6" input: "gpu_0/rpn_roi_probs_fpn2" input: "gpu_0/rpn_roi_probs_fpn3" input: "gpu_0/rpn_roi_probs_fpn4" input: "gpu_0/rpn_roi_probs_fpn5" input: "gpu_0/rpn_roi_probs_fpn6" input: "gpu_0/roidb" input: "gpu_0/im_info" output: "gpu_0/rois" output: "gpu_0/labels_int32" output: "gpu_0/bbox_targets" output: "gpu_0/bbox_inside_weights" output: "gpu_0/bbox_outside_weights" output: "gpu_0/keypoint_rois" output: "gpu_0/keypoint_locations_int32" output: "gpu_0/keypoint_weights" output: "gpu_0/keypoint_loss_normalizer" output: "gpu_0/rois_fpn2" output: "gpu_0/rois_fpn3" output: "gpu_0/rois_fpn4" output: "gpu_0/rois_fpn5" output: "gpu_0/rois_idx_restore_int32" output: "gpu_0/keypoint_rois_fpn2" output: "gpu_0/keypoint_rois_fpn3" output: "gpu_0/keypoint_rois_fpn4" output: "gpu_0/keypoint_rois_fpn5" output: "gpu_0/keypoint_rois_idx_restore_int32" name: "CollectAndDistributeFpnRpnProposalsOp:gpu_0/rpn_rois_fpn2,gpu_0/rpn_rois_fpn3,gpu_0/rpn_rois_fpn4,gpu_0/rpn_rois_fpn5,gpu_0/rpn_rois_fpn6,gpu_0/rpn_roi_probs_fpn2,gpu_0/rpn_roi_probs_fpn3,gpu_0/rpn_roi_probs_fpn4,gpu_0/rpn_roi_probs_fpn5,gpu_0/rpn_roi_probs_fpn6,gpu_0/roidb,gpu_0/im_info" type: "Python" arg { name: "grad_input_indices" } arg { name: "token" s: "forward:5" } arg { name: "grad_output_indices" } device_option { device_type: 1 cuda_gpu_id: 0 } debug_info: " File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 281, in \n main()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 119, in main\n checkpoints = train_model()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 128, in train_model\n model, start_iter, checkpoints, output_dir = create_model()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 206, in create_model\n model = model_builder.create(cfg.MODEL.TYPE, train=True)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 124, in create\n return get_func(model_type_func)(model)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 89, in generalized_rcnn\n freeze_conv_body=cfg.TRAIN.FREEZE_CONV_BODY\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 229, in build_generic_detection_model\n optim.build_data_parallel_model(model, _single_gpu_build_func)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/optimizer.py\", line 40, in build_data_parallel_model\n all_loss_gradients = _build_forward_graph(model, single_gpu_build_func)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/optimizer.py\", line 63, in _build_forward_graph\n all_loss_gradients.update(single_gpu_build_func(model))\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 189, in _single_gpu_build_func\n model, blob_conv, dim_conv, spatial_scale_conv\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/rpn_heads.py\", line 44, in add_generic_rpn_outputs\n model.CollectAndDistributeFpnRpnProposals()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/detector.py\", line 223, in CollectAndDistributeFpnRpnProposals\n )(blobs_in, blobs_out, name=name)\n File \"/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/caffe2/python/core.py\", line 2137, in \n dict(chain(viewitems(kwargs), viewitems(core_kwargs)))\n File \"/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/caffe2/python/core.py\", line 2024, in _CreateAndAddToSelf\n op = CreateOperator(op_type, inputs, outputs, kwargs)\n" Aborted at 1520667211 (unix time) try "date -d @1520667211" if you are using GNU date PC: @ 0x7fb544471428 gsignal SIGABRT (@0x3e8000021f7) received by PID 8695 (TID 0x7fb3af7fe700) from PID 8695; stack trace: @ 0x7fb544f27390 (unknown) @ 0x7fb544471428 gsignal @ 0x7fb54447302a abort @ 0x7fb531518b39 __gnu_cxx::verbose_terminate_handler() @ 0x7fb5315171fb cxxabiv1::__terminate() @ 0x7fb531517234 std::terminate() @ 0x7fb531532c8a execute_native_thread_routine_compat @ 0x7fb544f1d6ba start_thread @ 0x7fb54454341d clone



### System information

* Operating system: Ubuntu 16.04
* Compiler version: gcc 5.4.0
* CUDA version: 8.0
* cuDNN version: 7.0.5
* NVIDIA driver version: 384.111
* GPU models (for all devices if they are not all the same): 1 GTX1080Ti
* `python --version` output: 2.7 anaconda 
7oud commented 6 years ago

I changed the dataset to the original one

  DATASETS: ('keypoints_coco_2014_train', 'keypoints_coco_2014_valminusminival')

the error info changed to

I0310 17:43:43.737247 12519 operator.cc:173] Operator with engine CUDNN is not available for operator GetGPUMemoryUsage.
json_stats: {"accuracy_cls": 0.985315, "eta": "10:34:22", "iter": 240, "loss": NaN, "loss_bbox": NaN, "loss_cls": NaN, "loss_kps": NaN, "loss_rpn_bbox_fpn2": NaN, "loss_rpn_bbox_fpn3": NaN, "loss_rpn_bbox_fpn4": NaN, "loss_rpn_bbox_fpn5": 0.008082, "loss_rpn_bbox_fpn6": 0.002842, "loss_rpn_cls_fpn2": NaN, "loss_rpn_cls_fpn3": NaN, "loss_rpn_cls_fpn4": NaN, "loss_rpn_cls_fpn5": 0.010573, "loss_rpn_cls_fpn6": 0.003227, "lr": 0.013067, "mb_qsize": 64, "mem": 8230, "time": 0.424049}
CRITICAL train_net.py: 159: Loss is NaN, exiting...
INFO loader.py: 126: Stopping enqueue thread
/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/numpy/lib/function_base.py:4033: RuntimeWarning: Invalid value encountered in median
  r = func(a, **kwargs)
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
xiashh commented 6 years ago

I encountered the same problem. So have you solved it?

7oud commented 6 years ago

Not yet. Actually I have no idea at all

ZZYuting commented 6 years ago

your problem is similar to #16 it seems something wrong in your yaml file. the orign yaml is written for 8 gpu, you need to edit the SOVLER using "linear scaling rule'' which is mentioned “getting started ”. In faster rcnn,it is that:

Equivalent schedules with...

1 GPU:

 BASE_LR: 0.0025
 MAX_ITER: 60000
 STEPS: [0, 30000, 40000]

2 GPUs:

 BASE_LR: 0.005
 MAX_ITER: 30000
 STEPS: [0, 15000, 20000]

4 GPUs:

 BASE_LR: 0.01
 MAX_ITER: 15000
 STEPS: [0, 7500, 10000]

8 GPUs:

 BASE_LR: 0.02
 MAX_ITER: 7500
 STEPS: [0, 3750, 5000]

you need to adjust your sovler in your yaml file. I hope it will useful to you.

taoari commented 6 years ago

I am training on my own dataset. Simply following the linear scaling rule will not be sufficient. You may need to disable the assertion of the negative areas (I am not sure about the consequence of this, but the model can be trained, and the output also visualizes well). You may get NaN loss errors, just further reduce the learning rate or increase the number of gpus will solve the problem.

def boxes_area(boxes):
    """Compute the area of an array of boxes."""
    w = (boxes[:, 2] - boxes[:, 0] + 1)
    h = (boxes[:, 3] - boxes[:, 1] + 1)
    areas = w * h
    if not np.all(areas >= 0):
        print("Negative areas found", boxes[areas < 0])
        areas[areas < 0] = 0.0
    # assert np.all(areas >= 0), 'Negative areas founds'
    return areas
wenting-zhao commented 6 years ago

Making BASE_LR from 0.002 to 0.000125 works for me.

lilichu commented 6 years ago

@taoari hi! Can you explain what the assertion means please? Will it cause any problem If I just bypass the assertion? By the way, I find that if I minish the LR, it can work.

rbgirshick commented 6 years ago

This should be addressed by 47e457a.