Closed CPFelix closed 6 years ago
Error at: /home/scau2/Downloads/pytorch/caffe2/core/context_gpu.cu:327: out of memory
@CPFelix can you check that the RAM is not completely filled while running the model ? Looks like there is not enough RAM.
@gadcam but this problem has occured randomly,I had run successly with all the same on the same TITAN X,but sometimes it will like this "out of memory"
just now i train the coco_keypoints dataset successly when I shut down the computer and restart ,probably it's the RAM problem.but i still can't train my own datasets successly,I will find the problem!
@CPFelix 12 Go of RAM seems enough but I think you can at least test this hypothesis by monitoring the RAM usage as the only error I see in your trace is out of memory
.
@gadcam Now when I train my own dataset with keypoints,I will get the little different problem,it seems with nothing with memory:
E0605 17:11:07.014443 4777 net_dag.cc:188] Exception from operator chain starting at '' (type 'Concat'): caffe2::EnforceNotMet: [enforce fail at conv_pool_op_base.h:237] input.size() > 0. Error from operator:
input: "gpu0/[pose]_roi_feat" input: "gpu_0/conv_fcn1_w" input: "gpu_0/conv_fcn1_b" output: "gpu_0/conv_fcn1" name: "" type: "Conv" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"
I0605 17:11:07.014550 4776 context_gpu.cu:305] GPU 0: 5657 MB
I0605 17:11:07.014562 4776 context_gpu.cu:309] Total: 5657 MB
WARNING workspace.py: 185: Original python traceback for operator 283
in network generalized_rcnn
in exception above (most recent call last):
WARNING workspace.py: 190: File "tools/train_net.py", line 128, in
RuntimeError: [enforce fail at conv_pool_op_base.h:237] input.size() > 0. Error from operator:
@CPFelix I think this is one is #347, so I think you could close this issue and try to get some help by posting your trace in #347
@gadcam OK,thanks
My final problem is the same in #347,I will show my problem there
E0605 15:10:48.727974 8764 net_dag.cc:188] Exception from operator chain starting at '' (type 'RoIAlignGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:327] error == cudaSuccess. 2 vs 0. Error at: /home/scau2/Downloads/pytorch/caffe2/core/context_gpu.cu:327: out of memory Error from operator: input: "gpu_0/fpn_res2_2_sum" input: "gpu_0/keypoint_rois_fpn2" input: "gpu_0/m9_shared" output: "gpu_0/m11_shared" name: "" type: "RoIAlignGradient" arg { name: "pooled_h" i: 14 } arg { name: "sampling_ratio" i: 2 } arg { name: "spatial_scale" f: 0.25 } arg { name: "pooled_w" i: 14 } device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true WARNING workspace.py: 185: Original python traceback for operator
main()
File "tools/train_net.py", line 110, in main
checkpoints = detectron.utils.train.train_model()
File "/home/scau2/Downloads/Detectron-master/detectron/utils/train.py", line 65, in train_model
workspace.RunNet(model.net.Proto().name)
File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 217, in RunNet
StringifyNetName(name), num_iter, allow_fail,
File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 178, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [enforce fail at context_gpu.cu:327] error == cudaSuccess. 2 vs 0. Error at: /home/scau2/Downloads/pytorch/caffe2/core/context_gpu.cu:327: out of memory Error from operator:
input: "gpu_0/fpn_res2_2_sum" input: "gpu_0/keypoint_rois_fpn2" input: "gpu_0/m9_shared" output: "gpu_0/m11_shared" name: "" type: "RoIAlignGradient" arg { name: "pooled_h" i: 14 } arg { name: "sampling_ratio" i: 2 } arg { name: "spatial_scale" f: 0.25 } arg { name: "pooled_w" i: 14 } device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true
This has troubled me for a few weeks , can anyone help me ?
329
in networkgeneralized_rcnn
in exception above (most recent call last): Traceback (most recent call last): File "tools/train_net.py", line 128, in