Closed kingwpf closed 5 years ago
What kind of GPU you used to train this model? My GPU is Nvidia 1080ti, I was trying to train a model using the config 'e2e_mask_cascade_rcnn_dual-X-152-32x8d-FPN-IN5k_1.44x.yaml', but despite I set the batchsize to 1, the training still can't goes on.
[I net_async_base.h:206] Using specified CPU pool size: 16; device id: -1 [I net_async_base.h:211] Created new CPU pool, size: 16; device id: -1 [E net_async_base.cc:382] [enforce fail at context_gpu.cu:496] error == cudaSuccess. 2 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:496: out of memory Error from operator: input: "gpu_0/res4_23_branch2c" input: "gpu_0/res4_23_branch2c_bn_s" input: "gpu_0/res4_23_branch2c_bn_b" output: "gpu_0/res4_23_branch2c_bn" name: "" type: "AffineChannel" device_option { device_type: 1 device_id: 0 }frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::string const&, void const*) + 0x47
At least P40. 1080 Ti is even risky when you train single x152.
What kind of GPU you used to train this model? My GPU is Nvidia 1080ti, I was trying to train a model using the config 'e2e_mask_cascade_rcnn_dual-X-152-32x8d-FPN-IN5k_1.44x.yaml', but despite I set the batchsize to 1, the training still can't goes on. [I net_async_base.h:206] Using specified CPU pool size: 16; device id: -1 [I net_async_base.h:211] Created new CPU pool, size: 16; device id: -1 [E net_async_base.cc:382] [enforce fail at context_gpu.cu:496] error == cudaSuccess. 2 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:496: out of memory Error from operator: input: "gpu_0/res4_23_branch2c" input: "gpu_0/res4_23_branch2c_bn_s" input: "gpu_0/res4_23_branch2c_bn_b" output: "gpu_0/res4_23_branch2c_bn" name: "" type: "AffineChannel" device_option { device_type: 1 device_id: 0 }frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::string const&, void const*) + 0x47
At least P40. 1080 Ti is even risky when you train single x152.
emm, thanks for your reply
What kind of GPU you used to train this model? My GPU is Nvidia 1080ti, I was trying to train a model using the config 'e2e_mask_cascade_rcnn_dual-X-152-32x8d-FPN-IN5k_1.44x.yaml', but despite I set the batchsize to 1, the training still can't goes on.
[I net_async_base.h:206] Using specified CPU pool size: 16; device id: -1 [I net_async_base.h:211] Created new CPU pool, size: 16; device id: -1 [E net_async_base.cc:382] [enforce fail at context_gpu.cu:496] error == cudaSuccess. 2 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:496: out of memory Error from operator: input: "gpu_0/res4_23_branch2c" input: "gpu_0/res4_23_branch2c_bn_s" input: "gpu_0/res4_23_branch2c_bn_b" output: "gpu_0/res4_23_branch2c_bn" name: "" type: "AffineChannel" device_option { device_type: 1 device_id: 0 }frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::string const&, void const*) + 0x47