facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
Apache License 2.0
26.25k stars 5.45k forks source link

Errors while training faster-rcnn with RTX 2080 ti #1000

Open mmahdavian opened 4 years ago

mmahdavian commented 4 years ago

Hello

I am trying to train a faster-rcnn model with "e2e_faster_rcnn_R-50-FPN_1x.yaml" config. I have only 1 GPU and I have changed it in this file. When I start to run the training code it gives me this error:


[E net_async_base.cc:382] [enforce fail at context_gpu.cu:524] error == cudaSuccess. 2 vs 0. Error at: /home/sahar/Mohammad_ws/pytorch/caffe2/core/context_gpu.cu:524: out of memory (Error from operator: 
input: "gpu_0/fpn_inner_res3_3_sum" input: "gpu_0/__m14_shared" output: "_gpu_0/fpn_inner_res3_3_sum_grad_autosplit_0" name: "" type: "UpsampleNearestGradient" arg { name: "scale" i: 2 } device_option { device_type: 1 device_id: 0 } is_gradient_op: true)
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*) + 0x67 (0x7f65c798aee7 in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: <unknown function> + 0x25043fb (0x7f657113f3fb in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x1193d7 (0x7f64fb2ce3d7 in /usr/local/lib/python3.7/site-packages/torch/lib/libcaffe2_detectron_ops_gpu.so)
frame #3: caffe2::UpsampleNearestGradientOp<float, caffe2::CUDAContext>::RunOnDevice() + 0x2b5 (0x7f64fb2e0a65 in /usr/local/lib/python3.7/site-packages/torch/lib/libcaffe2_detectron_ops_gpu.so)
frame #4: <unknown function> + 0x122e15 (0x7f64fb2d7e15 in /usr/local/lib/python3.7/site-packages/torch/lib/libcaffe2_detectron_ops_gpu.so)
frame #5: caffe2::AsyncNetBase::run(int, int) + 0x185 (0x7f65857e0555 in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x1fc9e8a (0x7f6585742e8a in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cpu.so)
frame #7: c10::ThreadPool::main_loop(unsigned long) + 0x2fb (0x7f65c797df8b in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #8: <unknown function> + 0xb8c80 (0x7f65cf4e4c80 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #9: <unknown function> + 0x76ba (0x7f65d686f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #10: clone + 0x6d (0x7f65d5e9541d in /lib/x86_64-linux-gnu/libc.so.6)
,  op UpsampleNearestGradient
[E net_async_base.cc:382] [enforce fail at context_gpu.cu:524] error == cudaSuccess. 2 vs 0. Error at: /home/sahar/Mohammad_ws/pytorch/caffe2/core/context_gpu.cu:524: out of memory (Error from operator: 
input: "gpu_0/res2_2_sum" input: "gpu_0/fpn_inner_res2_2_sum_lateral_w" input: "gpu_0/__m14_shared" output: "gpu_0/fpn_inner_res2_2_sum_lateral_w_grad" output: "gpu_0/fpn_inner_res2_2_sum_lateral_b_grad" output: "gpu_0/__m13_shared" name: "" type: "ConvGradient" arg { name: "kernel" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "pad" i: 0 } arg { name: "stride" i: 1 } arg { name: "exhaustive_search" i: 0 } device_option { device_type: 1 device_id: 0 } engine: "CUDNN" is_gradient_op: true)
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*) + 0x67 (0x7f65c798aee7 in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: <unknown function> + 0x25043fb (0x7f657113f3fb in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x1fc41b7 (0x7f658573d1b7 in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cpu.so)
frame #3: caffe2::empty(c10::ArrayRef<long>, c10::TensorOptions) + 0x467 (0x7f658578cd17 in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cpu.so)
frame #4: caffe2::ReinitializeTensor(caffe2::Tensor*, c10::ArrayRef<long>, c10::TensorOptions) + 0x1f3 (0x7f658578d0f3 in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x2e19188 (0x7f6571a54188 in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x277f713 (0x7f65713ba713 in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x26da937 (0x7f6571315937 in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cuda.so)
frame #8: caffe2::AsyncNetBase::run(int, int) + 0x185 (0x7f65857e0555 in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x1fc9e8a (0x7f6585742e8a in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cpu.so)
frame #10: c10::ThreadPool::main_loop(unsigned long) + 0x2fb (0x7f65c797df8b in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #11: <unknown function> + 0xb8c80 (0x7f65cf4e4c80 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #12: <unknown function> + 0x76ba (0x7f65d686f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #13: clone + 0x6d (0x7f65d5e9541d in /lib/x86_64-linux-gnu/libc.so.6)
,  op ConvGradient
[E net_async_base.cc:134] Rethrowing exception from the run of 'generalized_rcnn'
WARNING workspace.py: 223: Original python traceback for operator `344` in network `generalized_rcnn` in exception above (most recent call last):
Traceback (most recent call last):
  File "tools/train_net.py", line 143, in <module>
    main()
  File "tools/train_net.py", line 125, in main
    checkpoints = detectron.utils.train.train_model()
  File "/home/sahar/Mohammad_ws/detectron/detectron/utils/train.py", line 67, in train_model
    workspace.RunNet(model.net.Proto().name)
  File "/usr/local/lib/python3.7/site-packages/caffe2/python/workspace.py", line 255, in RunNet
    StringifyNetName(name), num_iter, allow_fail,
  File "/usr/local/lib/python3.7/site-packages/caffe2/python/workspace.py", line 216, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [enforce fail at context_gpu.cu:524] error == cudaSuccess. 2 vs 0. Error at: /home/sahar/Mohammad_ws/pytorch/caffe2/core/context_gpu.cu:524: out of memory (Error from operator: 
input: "gpu_0/fpn_inner_res3_3_sum" input: "gpu_0/__m14_shared" output: "_gpu_0/fpn_inner_res3_3_sum_grad_autosplit_0" name: "" type: "UpsampleNearestGradient" arg { name: "scale" i: 2 } device_option { device_type: 1 device_id: 0 } is_gradient_op: true)
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*) + 0x67 (0x7f65c798aee7 in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: <unknown function> + 0x25043fb (0x7f657113f3fb in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x1193d7 (0x7f64fb2ce3d7 in /usr/local/lib/python3.7/site-packages/torch/lib/libcaffe2_detectron_ops_gpu.so)
frame #3: caffe2::UpsampleNearestGradientOp<float, caffe2::CUDAContext>::RunOnDevice() + 0x2b5 (0x7f64fb2e0a65 in /usr/local/lib/python3.7/site-packages/torch/lib/libcaffe2_detectron_ops_gpu.so)
frame #4: <unknown function> + 0x122e15 (0x7f64fb2d7e15 in /usr/local/lib/python3.7/site-packages/torch/lib/libcaffe2_detectron_ops_gpu.so)
frame #5: caffe2::AsyncNetBase::run(int, int) + 0x185 (0x7f65857e0555 in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x1fc9e8a (0x7f6585742e8a in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libtorch_cpu.so)
frame #7: c10::ThreadPool::main_loop(unsigned long) + 0x2fb (0x7f65c797df8b in /usr/local/lib/python3.7/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #8: <unknown function> + 0xb8c80 (0x7f65cf4e4c80 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #9: <unknown function> + 0x76ba (0x7f65d686f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #10: clone + 0x6d (0x7f65d5e9541d in /lib/x86_64-linux-gnu/libc.so.6)

I read somewhere that it may be because of scale value. Then I decrease scale from (800,) to (400,) and I get this error:

CRITICAL train.py:  98: Loss is NaN
INFO loader.py: 126: Stopping enqueue thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
Traceback (most recent call last):
  File "tools/train_net.py", line 143, in <module>
    main()
  File "tools/train_net.py", line 125, in main
    checkpoints = detectron.utils.train.train_model()
  File "/home/sahar/Mohammad_ws/detectron/detectron/utils/train.py", line 86, in train_model
    handle_critical_error(model, 'Loss is NaN')
  File "/home/sahar/Mohammad_ws/detectron/detectron/utils/train.py", line 100, in handle_critical_error
    raise Exception(msg)
Exception: Loss is NaN

Do you knwo what might be the problem?

System information