gpu out of memory issue

VDIGPKU / CBNet_caffe

Composite Backbone Network (AAAI20)

Apache License 2.0

408 stars 78 forks source link

gpu out of memory issue #5

Closed kingwpf closed 5 years ago

kingwpf commented 5 years ago

What kind of GPU you used to train this model? My GPU is Nvidia 1080ti, I was trying to train a model using the config 'e2e_mask_cascade_rcnn_dual-X-152-32x8d-FPN-IN5k_1.44x.yaml', but despite I set the batchsize to 1, the training still can't goes on.

[I net_async_base.h:206] Using specified CPU pool size: 16; device id: -1 [I net_async_base.h:211] Created new CPU pool, size: 16; device id: -1 [E net_async_base.cc:382] [enforce fail at context_gpu.cu:496] error == cudaSuccess. 2 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:496: out of memory Error from operator: input: "gpu_0/res4_23_branch2c" input: "gpu_0/res4_23_branch2c_bn_s" input: "gpu_0/res4_23_branch2c_bn_b" output: "gpu_0/res4_23_branch2c_bn" name: "" type: "AffineChannel" device_option { device_type: 1 device_id: 0 }frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::string const&, void const*) + 0x47

PKUbahuangliuhe commented 5 years ago

What kind of GPU you used to train this model? My GPU is Nvidia 1080ti, I was trying to train a model using the config 'e2e_mask_cascade_rcnn_dual-X-152-32x8d-FPN-IN5k_1.44x.yaml', but despite I set the batchsize to 1, the training still can't goes on.

[I net_async_base.h:206] Using specified CPU pool size: 16; device id: -1 [I net_async_base.h:211] Created new CPU pool, size: 16; device id: -1 [E net_async_base.cc:382] [enforce fail at context_gpu.cu:496] error == cudaSuccess. 2 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:496: out of memory Error from operator: input: "gpu_0/res4_23_branch2c" input: "gpu_0/res4_23_branch2c_bn_s" input: "gpu_0/res4_23_branch2c_bn_b" output: "gpu_0/res4_23_branch2c_bn" name: "" type: "AffineChannel" device_option { device_type: 1 device_id: 0 }frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::string const&, void const*) + 0x47

At least P40. 1080 Ti is even risky when you train single x152.

kingwpf commented 5 years ago

What kind of GPU you used to train this model? My GPU is Nvidia 1080ti, I was trying to train a model using the config 'e2e_mask_cascade_rcnn_dual-X-152-32x8d-FPN-IN5k_1.44x.yaml', but despite I set the batchsize to 1, the training still can't goes on. [I net_async_base.h:206] Using specified CPU pool size: 16; device id: -1 [I net_async_base.h:211] Created new CPU pool, size: 16; device id: -1 [E net_async_base.cc:382] [enforce fail at context_gpu.cu:496] error == cudaSuccess. 2 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:496: out of memory Error from operator: input: "gpu_0/res4_23_branch2c" input: "gpu_0/res4_23_branch2c_bn_s" input: "gpu_0/res4_23_branch2c_bn_b" output: "gpu_0/res4_23_branch2c_bn" name: "" type: "AffineChannel" device_option { device_type: 1 device_id: 0 }frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::string const&, void const*) + 0x47

At least P40. 1080 Ti is even risky when you train single x152.

emm, thanks for your reply