Open TriLoo opened 4 years ago
Thanks for your interest in our work! How many GPUs are you using for your job? Have you tried using 1 GPU?
My server contains 8 P40 GPUs. I tried just using one GPU (cuda:0
) and same error happened.
I used
os.environ['CUDA_VISIBLE_DEVICE']="0"
# OR
torch.cuda.set_device(0)
to use cuda:0
only.
Also, I manually set the gpu_count
to 0.
The top several calling stacks stored in core file is shown as below:
#0 0x00007f6df6eb13ac in construct<_object*, _object*> (__p=0xb, this=0x7f6e4ab5c318) at /usr/include/c++/4.8.2/ext/new_allocator.h:120
#1 _S_construct<_object*, _object*> (__p=0xb, __a=...) at /usr/include/c++/4.8.2/bits/alloc_traits.h:254
#2 construct<_object*, _object*> (__p=0xb, __a=...) at /usr/include/c++/4.8.2/bits/alloc_traits.h:393
#3 emplace_back<_object*> (this=0x7f6e4ab5c318) at /usr/include/c++/4.8.2/bits/vector.tcc:96
#4 push_back (__x=<unknown type in /search/odin/songminghui/githubs/STEP/external/maskrcnn_benchmark/roi_layers/_C.cpython-36m-x86_64-linux-gnu.so, CU 0x0, DIE 0x12877a>, this=0x7f6e4ab5c318) at /usr/include/c++/4.8.2/bits/stl_vector.h:920
#5 loader_life_support (this=0x7ffd998f01f0) at /search/odin/songminghui/anaconda3/lib/python3.6/site-packages/torch/lib/include/pybind11/cast.h:44
#6 pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7f6deee3be28, kwargs_in=0x0) at /search/odin/songminghui/anaconda3/lib/python3.6/site-packages/torch/lib/include/pybind11/pybind11.h:618
@TriLoo Hello, I met the same problem, I used one TeslaP100 GPU with 16G memory, did you solve the problem? Looking forward to your reply!
Sorry, not yet ... It may be caused by the pytorch version, gcc version, but I am not sure.
By the way, my pytorch version is 1.0.2
Thanks for your interest in our work! How many GPUs are you using for your job? Have you tried using 1 GPU?
@xyang35 Could you please provide your version of gcc? Thanks a lot.
I use my own image datas with
demo.py
and got this error, no any other infos display.I have located the postion causing this error, its roi layer calling. However, I tested the
ROIAlign_cuda.cu
not usingPyTorch Tensor
as parameters but usefloat *
instead and no errors raise.my
gcc
version is 4.8.5, is the gcc version critical ? any advices? thanks