Open kampelmuehler opened 6 years ago
I add the same kind of issue (except it was not random). I solve this by lowering the amount of memory required modifying the config.yaml. In my case, I change the MAX_SIZE parameter in TRAIN from 1333 (baselines) to 833. I think you could also lower SCALES and BATCH_SIZE.
Hope it will help
@francoto thanks for the input. Indeed reducing the batch size could hinder this problem, but it will also impact model performance. Also the batchsize easily fits inside the GPU memory, but at some random point during training (usually after ~16k iterations) the memory usage suddenly increases and training crashes, which is strange behavior. I haven't yet had time to look into what triggers the context_gpu model to fire up though.
The problem occurs me when I do: ~/detectron$ CUDA_VISIBLE_DEVICES=0 python2 tools/train_net.py --cfg configs/04_2018_gn_baselines/scratch_e2e_mask_rcnn_R-50-FPN_3x_gn.yaml OUTPUT_DIR ~/tmp/detectron-output
Found Detectron ops lib: /home/intern/usr/local/lib/libcaffe2_detectron_ops_gpu.so Found Detectron ops lib: /home/intern/usr/local/lib/libcaffe2_detectron_ops_gpu.so E0504 22:55:48.136441 8525 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU. E0504 22:55:48.136483 8525 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU. E0504 22:55:48.136489 8525 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU. INFO train_net.py: 95: Called with args: INFO train_net.py: 96: Namespace(cfg_file='configs/04_2018_gn_baselines/scratch_e2e_mask_rcnn_R-50-FPN_3x_gn.yaml', multi_gpu_testing=False, opts=['OUTPUT_DIR', '/home/intern/tmp/detectron-output'], skip_test=False) INFO train_net.py: 102: Training with config: INFO train_net.py: 103: {'BBOX_XFORM_CLIP': 4.135166556742356,
...
INFO train.py: 131: Building model: generalized_rcnn WARNING cnn.py: 25: [====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information. WARNING memonger.py: 55: NOTE: Executing memonger to optimize gradient memory I0504 22:55:51.732862 8525 memonger.cc:236] Remapping 140 using 24 shared blobs. ... terminate called after throwing an instance of 'caffe2::EnforceNotMet' what(): [enforce fail at context_gpu.h:156] . Encountered CUDA error: invalid device ordinal Aborted at 1525445757 (unix time) try "date -d @1525445757" if you are using GNU date PC: @ 0x7fa26afec428 gsignal SIGABRT (@0x3f40000214d) received by PID 8525 (TID 0x7fa26c4da740) from PID 8525; stack trace: @ 0x7fa26baa2390 (unknown) @ 0x7fa26afec428 gsignal @ 0x7fa26afee02a abort @ 0x7fa26ad12b39 gnu_cxx::verbose_terminate_handler() @ 0x7fa26ad111fb cxxabiv1::terminate() @ 0x7fa26ad10640 cxa_call_terminate @ 0x7fa26ad10e6f __gxx_personality_v0 @ 0x7fa26aa77564 _Unwind_RaiseException_Phase2 @ 0x7fa26aa7781d _Unwind_RaiseException @ 0x7fa26ad11409 cxa_throw @ 0x7fa25379a109 caffe2::CUDAContext::~CUDAContext() @ 0x7fa253939412 caffe2::Operator<>::~Operator() @ 0x7fa2539e1bee caffe2::FillerOp<>::~FillerOp() @ 0x7fa2539e58f6 caffe2::XavierFillOp<>::~XavierFillOp() @ 0x7fa2539e5926 caffe2::XavierFillOp<>::~XavierFillOp() @ 0x7fa252801809 std::vector<>::~vector() @ 0x7fa2527fffcf caffe2::SimpleNet::SimpleNet() @ 0x7fa2527cb1a6 caffe2::CreateNet() @ 0x7fa2527cb8fd caffe2::CreateNet() @ 0x7fa252835532 caffe2::Workspace::RunNetOnce() @ 0x7fa25525e1ba _ZZN6caffe26python16addGlobalMethodsERN8pybind116moduleEENKUlRKNS1_5bytesEE28clES6.isra.2767.constprop.2859 @ 0x7fa25525e455 _ZZN8pybind1112cpp_function10initializeIZN6caffe26python16addGlobalMethodsERNS_6moduleEEUlRKNS_5bytesEE28_bJS8_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4FUNESQ @ 0x7fa25528b24d pybind11::cpp_function::dispatcher() @ 0x7fa26bd8f9c0 PyEval_EvalFrameEx @ 0x7fa26bd92519 PyEval_EvalCodeEx @ 0x7fa26bd8f4b2 PyEval_EvalFrameEx @ 0x7fa26bd92519 PyEval_EvalCodeEx @ 0x7fa26bd8f4b2 PyEval_EvalFrameEx @ 0x7fa26bd92519 PyEval_EvalCodeEx @ 0x7fa26bd8f4b2 PyEval_EvalFrameEx @ 0x7fa26bd92519 PyEval_EvalCodeEx @ 0x7fa26bd8f4b2 PyEval_EvalFrameEx Aborted (core dumped)
I will add my environment info later.
same problem here, but I have 100 Gb of free space... is there a specific memory used by my GPU? should I get a better GPU? Btw, I already trained the model once and it didn't give me this error
This problem for me occurs very randomly. The network (in this case Retinanet) is training just fine, when at a random number of iterations
context_gpu.cu
fires up and seems to eat up the gpu memory such that the training is halted with an out of memory error.We're using Ubuntu 16.04 with Pascal GPUs. Happens on several machines, with different numbers of GPUs (1-4) and when training different network architectures.
Any thoughts?