facebookresearch / VMZ

VMZ: Model Zoo for Video Modeling
Apache License 2.0
1.04k stars 156 forks source link

Out of memory #92

Closed harshit-jain-git closed 5 years ago

harshit-jain-git commented 5 years ago

RuntimeError: [enforce fail at context_gpu.cu:496] error == cudaSuccess. 2 vs 0. Error at: /home/harshit_kunal/pytorch/caffe2/core/context_gpu.cu:496: out of memory Error from operator: input: "gpu_0/comp_9_conv_1" input: "gpu_0/comp_9_spatbn_1_s" input: "gpu_0/comp_9_spatbn_1_b" input: "gpu_0/comp_9_spatbn_1_rm" input: "gpu_0/comp_9_spatbn_1_riv" output: "gpu_0/comp_9_spatbn_1" name: "" type: "SpatialBN" arg { name: "epsilon" f: 0.001 } arg { name: "cudnn_exhaustive_search" i: 1 } arg { name: "is_test" i: 1 } arg { name: "use_cudnn" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "momentum" f: 0.9 } device_option { device_type: 1 device_id: 0 }frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, void const*) + 0x78 (0x7f7974859208 in /home/harshit_kunal/pytorch/build/lib/libc10.so) frame #1: + 0x2b7b233 (0x7f797781e233 in /home/harshit_kunal/pytorch/build/lib/libtorch.so) frame #2: + 0x2c77d9e (0x7f797791ad9e in /home/harshit_kunal/pytorch/build/lib/libtorch.so) frame #3: caffe2::empty(c10::ArrayRef, c10::TensorOptions) + 0x487 (0x7f79790a7567 in /home/harshit_kunal/pytorch/build/lib/libtorch.so) frame #4: + 0x2d37de2 (0x7f79779dade2 in /home/harshit_kunal/pytorch/build/lib/libtorch.so) frame #5: + 0x2d38c38 (0x7f79779dbc38 in /home/harshit_kunal/pytorch/build/lib/libtorch.so) frame #6: + 0x2d45183 (0x7f79779e8183 in /home/harshit_kunal/pytorch/build/lib/libtorch.so) frame #7: + 0x2f89153 (0x7f7977c2c153 in /home/harshit_kunal/pytorch/build/lib/libtorch.so) frame #8: + 0x2d39ef0 (0x7f79779dcef0 in /home/harshit_kunal/pytorch/build/lib/libtorch.so) frame #9: caffe2::AsyncNetBase::run(int, int) + 0x118 (0x7f7979037928 in /home/harshit_kunal/pytorch/build/lib/libtorch.so) frame #10: + 0x439b71a (0x7f797903e71a in /home/harshit_kunal/pytorch/build/lib/libtorch.so) frame #11: c10::ThreadPool::main_loop(unsigned long) + 0x2b3 (0x7f7974852543 in /home/harshit_kunal/pytorch/build/lib/libc10.so) frame #12: + 0xc819d (0x7f798e7a919d in /home/harshit_kunal/anaconda3/envs/pytorch_py2/bin/../lib/libstdc++.so.6) frame #13: + 0x76db (0x7f7990cf76db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #14: clone + 0x3f (0x7f799027b88f in /lib/x86_64-linux-gnu/libc.so.6)

I am getting this out of memory error even with a very small batch size while finetuning the model for hmdb. I am running this on 1 GPU with 11 GB memory (RTX 2080Ti). What to do ?

dutran commented 5 years ago

You can either change to smaller batch size or smaller resolution or both.