NVlabs / STEP

STEP: Spatio-Temporal Progressive Learning for Video Action Detection. CVPR'19 (Oral)
247 stars 48 forks source link

Runtime Error: Segmentation fault #10

Open TriLoo opened 4 years ago

TriLoo commented 4 years ago

I use my own image datas with demo.py and got this error, no any other infos display.

I have located the postion causing this error, its roi layer calling. However, I tested the ROIAlign_cuda.cu not using PyTorch Tensor as parameters but use float * instead and no errors raise.

my gcc version is 4.8.5, is the gcc version critical ? any advices? thanks

xyang35 commented 4 years ago

Thanks for your interest in our work! How many GPUs are you using for your job? Have you tried using 1 GPU?

TriLoo commented 4 years ago

My server contains 8 P40 GPUs. I tried just using one GPU (cuda:0) and same error happened.

I used

os.environ['CUDA_VISIBLE_DEVICE']="0"

# OR

torch.cuda.set_device(0)

to use cuda:0 only.

Also, I manually set the gpu_count to 0.

The top several calling stacks stored in core file is shown as below:

#0  0x00007f6df6eb13ac in construct<_object*, _object*> (__p=0xb, this=0x7f6e4ab5c318) at /usr/include/c++/4.8.2/ext/new_allocator.h:120
#1  _S_construct<_object*, _object*> (__p=0xb, __a=...) at /usr/include/c++/4.8.2/bits/alloc_traits.h:254
#2  construct<_object*, _object*> (__p=0xb, __a=...) at /usr/include/c++/4.8.2/bits/alloc_traits.h:393
#3  emplace_back<_object*> (this=0x7f6e4ab5c318) at /usr/include/c++/4.8.2/bits/vector.tcc:96
#4  push_back (__x=<unknown type in /search/odin/songminghui/githubs/STEP/external/maskrcnn_benchmark/roi_layers/_C.cpython-36m-x86_64-linux-gnu.so, CU 0x0, DIE 0x12877a>, this=0x7f6e4ab5c318) at /usr/include/c++/4.8.2/bits/stl_vector.h:920
#5  loader_life_support (this=0x7ffd998f01f0) at /search/odin/songminghui/anaconda3/lib/python3.6/site-packages/torch/lib/include/pybind11/cast.h:44
#6  pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7f6deee3be28, kwargs_in=0x0) at /search/odin/songminghui/anaconda3/lib/python3.6/site-packages/torch/lib/include/pybind11/pybind11.h:618
quanh1990 commented 4 years ago

@TriLoo Hello, I met the same problem, I used one TeslaP100 GPU with 16G memory, did you solve the problem? Looking forward to your reply!

TriLoo commented 4 years ago

Sorry, not yet ... It may be caused by the pytorch version, gcc version, but I am not sure.

By the way, my pytorch version is 1.0.2

quanh1990 commented 4 years ago

Thanks for your interest in our work! How many GPUs are you using for your job? Have you tried using 1 GPU?

@xyang35 Could you please provide your version of gcc? Thanks a lot.