Closed 7oud closed 6 years ago
I changed the dataset to the original one
DATASETS: ('keypoints_coco_2014_train', 'keypoints_coco_2014_valminusminival')
the error info changed to
I0310 17:43:43.737247 12519 operator.cc:173] Operator with engine CUDNN is not available for operator GetGPUMemoryUsage.
json_stats: {"accuracy_cls": 0.985315, "eta": "10:34:22", "iter": 240, "loss": NaN, "loss_bbox": NaN, "loss_cls": NaN, "loss_kps": NaN, "loss_rpn_bbox_fpn2": NaN, "loss_rpn_bbox_fpn3": NaN, "loss_rpn_bbox_fpn4": NaN, "loss_rpn_bbox_fpn5": 0.008082, "loss_rpn_bbox_fpn6": 0.002842, "loss_rpn_cls_fpn2": NaN, "loss_rpn_cls_fpn3": NaN, "loss_rpn_cls_fpn4": NaN, "loss_rpn_cls_fpn5": 0.010573, "loss_rpn_cls_fpn6": 0.003227, "lr": 0.013067, "mb_qsize": 64, "mem": 8230, "time": 0.424049}
CRITICAL train_net.py: 159: Loss is NaN, exiting...
INFO loader.py: 126: Stopping enqueue thread
/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/numpy/lib/function_base.py:4033: RuntimeWarning: Invalid value encountered in median
r = func(a, **kwargs)
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
I encountered the same problem. So have you solved it?
Not yet. Actually I have no idea at all
your problem is similar to #16 it seems something wrong in your yaml file. the orign yaml is written for 8 gpu, you need to edit the SOVLER using "linear scaling rule'' which is mentioned “getting started ”. In faster rcnn,it is that:
BASE_LR: 0.0025
MAX_ITER: 60000
STEPS: [0, 30000, 40000]
BASE_LR: 0.005
MAX_ITER: 30000
STEPS: [0, 15000, 20000]
BASE_LR: 0.01
MAX_ITER: 15000
STEPS: [0, 7500, 10000]
BASE_LR: 0.02
MAX_ITER: 7500
STEPS: [0, 3750, 5000]
you need to adjust your sovler in your yaml file. I hope it will useful to you.
I am training on my own dataset. Simply following the linear scaling rule will not be sufficient. You may need to disable the assertion of the negative areas (I am not sure about the consequence of this, but the model can be trained, and the output also visualizes well). You may get NaN loss errors, just further reduce the learning rate or increase the number of gpus will solve the problem.
def boxes_area(boxes):
"""Compute the area of an array of boxes."""
w = (boxes[:, 2] - boxes[:, 0] + 1)
h = (boxes[:, 3] - boxes[:, 1] + 1)
areas = w * h
if not np.all(areas >= 0):
print("Negative areas found", boxes[areas < 0])
areas[areas < 0] = 0.0
# assert np.all(areas >= 0), 'Negative areas founds'
return areas
Making BASE_LR from 0.002 to 0.000125 works for me.
@taoari hi! Can you explain what the assertion means please? Will it cause any problem If I just bypass the assertion? By the way, I find that if I minish the LR, it can work.
This should be addressed by 47e457a.
I try to train e2e keypoints model using 1 gpu, some modifications list below
train_net.py
to use local downloaded weightsadd a new config yaml which is similar to
e2e_keypoint_rcnn_R-50-FPN_1x.yaml
lib/datasets/data/coco
and run commandActual results
At: /home/zoud/Workspace/Github/7oud/Detectron/lib/utils/boxes.py(62): boxes_area /home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/FPN.py(449): map_rois_to_fpn_levels /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(278): _distribute_rois_over_fpn_levels /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(286): _add_multilevel_rois /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(121): add_fast_rcnn_blobs /home/zoud/Workspace/Github/7oud/Detectron/lib/ops/collect_and_distribute_fpn_rpn_proposals.py(60): forward terminate called after throwing an instance of 'caffe2::EnforceNotMet' what(): [enforce fail at pybind_state.h:423] . Exception encountered running PythonOp function: AssertionError: Negative areas founds
At: /home/zoud/Workspace/Github/7oud/Detectron/lib/utils/boxes.py(62): boxes_area /home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/FPN.py(449): map_rois_to_fpn_levels /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(278): _distribute_rois_over_fpn_levels /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(286): _add_multilevel_rois /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(121): add_fast_rcnn_blobs /home/zoud/Workspace/Github/7oud/Detectron/lib/ops/collect_and_distribute_fpn_rpn_proposals.py(60): forward Error from operator: input: "gpu_0/rpn_rois_fpn2" input: "gpu_0/rpn_rois_fpn3" input: "gpu_0/rpn_rois_fpn4" input: "gpu_0/rpn_rois_fpn5" input: "gpu_0/rpn_rois_fpn6" input: "gpu_0/rpn_roi_probs_fpn2" input: "gpu_0/rpn_roi_probs_fpn3" input: "gpu_0/rpn_roi_probs_fpn4" input: "gpu_0/rpn_roi_probs_fpn5" input: "gpu_0/rpn_roi_probs_fpn6" input: "gpu_0/roidb" input: "gpu_0/im_info" output: "gpu_0/rois" output: "gpu_0/labels_int32" output: "gpu_0/bbox_targets" output: "gpu_0/bbox_inside_weights" output: "gpu_0/bbox_outside_weights" output: "gpu_0/keypoint_rois" output: "gpu_0/keypoint_locations_int32" output: "gpu_0/keypoint_weights" output: "gpu_0/keypoint_loss_normalizer" output: "gpu_0/rois_fpn2" output: "gpu_0/rois_fpn3" output: "gpu_0/rois_fpn4" output: "gpu_0/rois_fpn5" output: "gpu_0/rois_idx_restore_int32" output: "gpu_0/keypoint_rois_fpn2" output: "gpu_0/keypoint_rois_fpn3" output: "gpu_0/keypoint_rois_fpn4" output: "gpu_0/keypoint_rois_fpn5" output: "gpu_0/keypoint_rois_idx_restore_int32" name: "CollectAndDistributeFpnRpnProposalsOp:gpu_0/rpn_rois_fpn2,gpu_0/rpn_rois_fpn3,gpu_0/rpn_rois_fpn4,gpu_0/rpn_rois_fpn5,gpu_0/rpn_rois_fpn6,gpu_0/rpn_roi_probs_fpn2,gpu_0/rpn_roi_probs_fpn3,gpu_0/rpn_roi_probs_fpn4,gpu_0/rpn_roi_probs_fpn5,gpu_0/rpn_roi_probs_fpn6,gpu_0/roidb,gpu_0/im_info" type: "Python" arg { name: "grad_input_indices" } arg { name: "token" s: "forward:5" } arg { name: "grad_output_indices" } device_option { device_type: 0 } debug_info: " File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 281, in\n main()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 119, in main\n checkpoints = train_model()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 128, in train_model\n model, start_iter, checkpoints, output_dir = create_model()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 206, in create_model\n model = model_builder.create(cfg.MODEL.TYPE, train=True)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 124, in create\n return get_func(model_type_func)(model)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 89, in generalized_rcnn\n freeze_conv_body=cfg.TRAIN.FREEZE_CONV_BODY\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 229, in build_generic_detection_model\n optim.build_data_parallel_model(model, _single_gpu_build_func)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/optimizer.py\", line 40, in build_data_parallel_model\n all_loss_gradients = _build_forward_graph(model, single_gpu_build_func)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/optimizer.py\", line 63, in _build_forward_graph\n all_loss_gradients.update(single_gpu_build_func(model))\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 189, in _single_gpu_build_func\n model, blob_conv, dim_conv, spatial_scale_conv\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/rpn_heads.py\", line 44, in add_generic_rpn_outputs\n model.CollectAndDistributeFpnRpnProposals()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/detector.py\", line 223, in CollectAndDistributeFpnRpnProposals\n )(blobs_in, blobs_out, name=name)\n File \"/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/caffe2/python/core.py\", line 2137, in \n dict(chain(viewitems(kwargs), viewitems(core_kwargs)))\n File \"/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/caffe2/python/core.py\", line 2024, in _CreateAndAddToSelf\n op = CreateOperator(op_type, inputs, outputs, kwargs)\n"Error from operator:
input: "gpu_0/rpn_rois_fpn2" input: "gpu_0/rpn_rois_fpn3" input: "gpu_0/rpn_rois_fpn4" input: "gpu_0/rpn_rois_fpn5" input: "gpu_0/rpn_rois_fpn6" input: "gpu_0/rpn_roi_probs_fpn2" input: "gpu_0/rpn_roi_probs_fpn3" input: "gpu_0/rpn_roi_probs_fpn4" input: "gpu_0/rpn_roi_probs_fpn5" input: "gpu_0/rpn_roi_probs_fpn6" input: "gpu_0/roidb" input: "gpu_0/im_info" output: "gpu_0/rois" output: "gpu_0/labels_int32" output: "gpu_0/bbox_targets" output: "gpu_0/bbox_inside_weights" output: "gpu_0/bbox_outside_weights" output: "gpu_0/keypoint_rois" output: "gpu_0/keypoint_locations_int32" output: "gpu_0/keypoint_weights" output: "gpu_0/keypoint_loss_normalizer" output: "gpu_0/rois_fpn2" output: "gpu_0/rois_fpn3" output: "gpu_0/rois_fpn4" output: "gpu_0/rois_fpn5" output: "gpu_0/rois_idx_restore_int32" output: "gpu_0/keypoint_rois_fpn2" output: "gpu_0/keypoint_rois_fpn3" output: "gpu_0/keypoint_rois_fpn4" output: "gpu_0/keypoint_rois_fpn5" output: "gpu_0/keypoint_rois_idx_restore_int32" name: "CollectAndDistributeFpnRpnProposalsOp:gpu_0/rpn_rois_fpn2,gpu_0/rpn_rois_fpn3,gpu_0/rpn_rois_fpn4,gpu_0/rpn_rois_fpn5,gpu_0/rpn_rois_fpn6,gpu_0/rpn_roi_probs_fpn2,gpu_0/rpn_roi_probs_fpn3,gpu_0/rpn_roi_probs_fpn4,gpu_0/rpn_roi_probs_fpn5,gpu_0/rpn_roi_probs_fpn6,gpu_0/roidb,gpu_0/im_info" type: "Python" arg { name: "grad_input_indices" } arg { name: "token" s: "forward:5" } arg { name: "grad_output_indices" } device_option { device_type: 1 cuda_gpu_id: 0 } debug_info: " File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 281, in \n main()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 119, in main\n checkpoints = train_model()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 128, in train_model\n model, start_iter, checkpoints, output_dir = create_model()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 206, in create_model\n model = model_builder.create(cfg.MODEL.TYPE, train=True)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 124, in create\n return get_func(model_type_func)(model)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 89, in generalized_rcnn\n freeze_conv_body=cfg.TRAIN.FREEZE_CONV_BODY\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 229, in build_generic_detection_model\n optim.build_data_parallel_model(model, _single_gpu_build_func)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/optimizer.py\", line 40, in build_data_parallel_model\n all_loss_gradients = _build_forward_graph(model, single_gpu_build_func)\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/optimizer.py\", line 63, in _build_forward_graph\n all_loss_gradients.update(single_gpu_build_func(model))\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 189, in _single_gpu_build_func\n model, blob_conv, dim_conv, spatial_scale_conv\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/rpn_heads.py\", line 44, in add_generic_rpn_outputs\n model.CollectAndDistributeFpnRpnProposals()\n File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/detector.py\", line 223, in CollectAndDistributeFpnRpnProposals\n )(blobs_in, blobs_out, name=name)\n File \"/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/caffe2/python/core.py\", line 2137, in \n dict(chain(viewitems(kwargs), viewitems(core_kwargs)))\n File \"/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/caffe2/python/core.py\", line 2024, in _CreateAndAddToSelf\n op = CreateOperator(op_type, inputs, outputs, kwargs)\n"
Aborted at 1520667211 (unix time) try "date -d @1520667211" if you are using GNU date
PC: @ 0x7fb544471428 gsignal
SIGABRT (@0x3e8000021f7) received by PID 8695 (TID 0x7fb3af7fe700) from PID 8695; stack trace:
@ 0x7fb544f27390 (unknown)
@ 0x7fb544471428 gsignal
@ 0x7fb54447302a abort
@ 0x7fb531518b39 __gnu_cxx::verbose_terminate_handler()
@ 0x7fb5315171fb cxxabiv1::__terminate()
@ 0x7fb531517234 std::terminate()
@ 0x7fb531532c8a execute_native_thread_routine_compat
@ 0x7fb544f1d6ba start_thread
@ 0x7fb54454341d clone