facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
Apache License 2.0
26.23k stars 5.45k forks source link

When i run MaskRCNN with keypoints ,I got the following problem,I think it has nothing with my config file.(No matter on COCO_keypoints or my own datasets) #473

Closed CPFelix closed 6 years ago

CPFelix commented 6 years ago

E0605 15:10:48.727974 8764 net_dag.cc:188] Exception from operator chain starting at '' (type 'RoIAlignGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:327] error == cudaSuccess. 2 vs 0. Error at: /home/scau2/Downloads/pytorch/caffe2/core/context_gpu.cu:327: out of memory Error from operator: input: "gpu_0/fpn_res2_2_sum" input: "gpu_0/keypoint_rois_fpn2" input: "gpu_0/m9_shared" output: "gpu_0/m11_shared" name: "" type: "RoIAlignGradient" arg { name: "pooled_h" i: 14 } arg { name: "sampling_ratio" i: 2 } arg { name: "spatial_scale" f: 0.25 } arg { name: "pooled_w" i: 14 } device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true WARNING workspace.py: 185: Original python traceback for operator 329 in network generalized_rcnn in exception above (most recent call last): Traceback (most recent call last): File "tools/train_net.py", line 128, in main() File "tools/train_net.py", line 110, in main checkpoints = detectron.utils.train.train_model() File "/home/scau2/Downloads/Detectron-master/detectron/utils/train.py", line 65, in train_model workspace.RunNet(model.net.Proto().name) File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 217, in RunNet StringifyNetName(name), num_iter, allow_fail, File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 178, in CallWithExceptionIntercept return func(*args, **kwargs) RuntimeError: [enforce fail at context_gpu.cu:327] error == cudaSuccess. 2 vs 0. Error at: /home/scau2/Downloads/pytorch/caffe2/core/context_gpu.cu:327: out of memory Error from operator: input: "gpu_0/fpn_res2_2_sum" input: "gpu_0/keypoint_rois_fpn2" input: "gpu_0/m9_shared" output: "gpu_0/m11_shared" name: "" type: "RoIAlignGradient" arg { name: "pooled_h" i: 14 } arg { name: "sampling_ratio" i: 2 } arg { name: "spatial_scale" f: 0.25 } arg { name: "pooled_w" i: 14 } device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true This has troubled me for a few weeks , can anyone help me ?

gadcam commented 6 years ago

Error at: /home/scau2/Downloads/pytorch/caffe2/core/context_gpu.cu:327: out of memory

@CPFelix can you check that the RAM is not completely filled while running the model ? Looks like there is not enough RAM.

CPFelix commented 6 years ago

@gadcam but this problem has occured randomly,I had run successly with all the same on the same TITAN X,but sometimes it will like this "out of memory"

CPFelix commented 6 years ago

just now i train the coco_keypoints dataset successly when I shut down the computer and restart ,probably it's the RAM problem.but i still can't train my own datasets successly,I will find the problem!

gadcam commented 6 years ago

@CPFelix 12 Go of RAM seems enough but I think you can at least test this hypothesis by monitoring the RAM usage as the only error I see in your trace is out of memory.

CPFelix commented 6 years ago

@gadcam Now when I train my own dataset with keypoints,I will get the little different problem,it seems with nothing with memory: E0605 17:11:07.014443 4777 net_dag.cc:188] Exception from operator chain starting at '' (type 'Concat'): caffe2::EnforceNotMet: [enforce fail at conv_pool_op_base.h:237] input.size() > 0. Error from operator: input: "gpu0/[pose]_roi_feat" input: "gpu_0/conv_fcn1_w" input: "gpu_0/conv_fcn1_b" output: "gpu_0/conv_fcn1" name: "" type: "Conv" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN" I0605 17:11:07.014550 4776 context_gpu.cu:305] GPU 0: 5657 MB I0605 17:11:07.014562 4776 context_gpu.cu:309] Total: 5657 MB WARNING workspace.py: 185: Original python traceback for operator 283 in network generalized_rcnn in exception above (most recent call last): WARNING workspace.py: 190: File "tools/train_net.py", line 128, in WARNING workspace.py: 190: File "tools/train_net.py", line 110, in main WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/utils/train.py", line 53, in train_model WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/utils/train.py", line 132, in create_model WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/model_builder.py", line 124, in create WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/model_builder.py", line 89, in generalized_rcnn WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/model_builder.py", line 229, in build_generic_detection_model WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/optimizer.py", line 40, in build_data_parallel_model WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/optimizer.py", line 63, in _build_forward_graph WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/model_builder.py", line 217, in _single_gpu_build_func WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/model_builder.py", line 302, in _add_roi_keypoint_head WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/keypoint_rcnn_heads.py", line 214, in add_roi_pose_head_v1convX WARNING workspace.py: 190: File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/cnn.py", line 169, in Relu WARNING workspace.py: 190: File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/brew.py", line 106, in scope_wrapper WARNING workspace.py: 190: File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/helpers/nonlinearity.py", line 36, in relu Traceback (most recent call last): File "tools/train_net.py", line 128, in main() File "tools/train_net.py", line 110, in main checkpoints = detectron.utils.train.train_model() File "/home/scau2/Downloads/Detectron-master/detectron/utils/train.py", line 65, in train_model workspace.RunNet(model.net.Proto().name) File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 217, in RunNet StringifyNetName(name), num_iter, allow_fail, File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 178, in CallWithExceptionIntercept return func(*args, **kwargs) RuntimeError: [enforce fail at conv_pool_op_base.h:237] input.size() > 0. Error from operator: input: "gpu0/[pose]_roi_feat" input: "gpu_0/conv_fcn1_w" input: "gpu_0/conv_fcn1_b" output: "gpu_0/conv_fcn1" name: "" type: "Conv" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"

gadcam commented 6 years ago

RuntimeError: [enforce fail at conv_pool_op_base.h:237] input.size() > 0. Error from operator:

@CPFelix I think this is one is #347, so I think you could close this issue and try to get some help by posting your trace in #347

CPFelix commented 6 years ago

@gadcam OK,thanks

CPFelix commented 6 years ago

My final problem is the same in #347,I will show my problem there