Closed mguillau closed 5 years ago
Can you backtrace and show the stacktrace?
Here's the output of backtrace:
(gdb) run test_shadow_light.py
Starting program: /home/ubuntu/miniconda3/bin/python test_shadow_light.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffa2ae4700 (LWP 1132)]
[New Thread 0x7fff894f7700 (LWP 1133)]
Scene construction, time: 0.07212 s
[New Thread 0x7fff88bf6700 (LWP 1134)]
[New Thread 0x7fff82818700 (LWP 1135)]
[New Thread 0x7fff80d02700 (LWP 1136)]
Thread 1 "python" received signal SIGBUS, Bus error.
ChannelInfo::ChannelInfo (this=0x7fffffffb890, channels=..., use_gpu=<optimized out>) at /home/ubuntu/src/redner/channels.cpp:25
25 this->channels[i] = channels[i];
(gdb) bt
#0 ChannelInfo::ChannelInfo (this=0x7fffffffb890, channels=..., use_gpu=<optimized out>) at /home/ubuntu/src/redner/channels.cpp:25
#1 0x00007fffa19f6a81 in render (scene=..., options=..., rendered_image=..., d_rendered_image=..., d_scene=..., debug_image=...)
at /home/ubuntu/src/redner/pathtracer.cpp:390
#2 0x00007fffa1976e88 in pybind11::detail::argument_loader<Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float> >::call_impl<void, void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, pybind11::detail::void_type>(void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul>, pybind11::detail::void_type&&) (f=<optimized out>, this=0x7fffffffcd70)
at /home/ubuntu/miniconda3/include/python3.7m/pybind11/cast.h:1874
#3 pybind11::detail::argument_loader<Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float> >::call<void, pybind11::detail::void_type, void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>)>(void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>)) && (f=<optimized out>, this=<optimized out>)
at /home/ubuntu/miniconda3/include/python3.7m/pybind11/cast.h:1856
#4 void pybind11::cpp_function::initialize<void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), void, Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>, pybind11::name, pybind11::scope, pybind11::sibling, char [1]>(void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), void (*)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [1])::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call&) const (call=...,
__closure=0x0) at /home/ubuntu/miniconda3/include/python3.7m/pybind11/pybind11.h:154
#5 void pybind11::cpp_function::initialize<void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), void, Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>, pybind11::name, pybind11::scope, pybind11::sibling, char [1]>(void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), void (*)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [1])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
at /home/ubuntu/miniconda3/include/python3.7m/pybind11/pybind11.h:132
#6 0x00007fffa193bfcc in pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7ffff6d7ed08, kwargs_in=0x0)
at /home/ubuntu/miniconda3/include/python3.7m/pybind11/pybind11.h:627
#7 0x00005555556cd6e4 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:690
#8 0x00005555556cd801 in _PyCFunction_FastCallKeywords (func=0x7fffa3ece750, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>)
at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:730
#9 0x00005555557292bc in call_function (kwnames=0x0, oparg=6, pp_stack=<synthetic pointer>)
at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4568
#10 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3093
#11 0x000055555566a4f9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3930
#12 0x000055555566b5d5 in _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:376
#13 0x00007fffe8cc9ce9 in THPFunction_apply(_object*, _object*) ()
from /home/ubuntu/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#14 0x0000555555690be7 in cfunction_call_varargs (kwargs=<optimized out>, args=<optimized out>, func=0x7fff8a2f41b0)
at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:768
#15 PyCFunction_Call () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:784
#16 0x000055555572a151 in do_call_core (kwdict=0x0, callargs=0x555557e38468, func=0x7fff8a2f41b0)
at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4641
#17 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3191
#18 0x000055555566a4f9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3930
#19 0x000055555566b3c4 in PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3959
#20 0x000055555566b3ec in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:524
#21 0x0000555555783874 in run_mod () at /tmp/build/80754af9/python_1553721932202/work/Python/pythonrun.c:1035
#22 0x000055555578db81 in PyRun_FileExFlags () at /tmp/build/80754af9/python_1553721932202/work/Python/pythonrun.c:988
#23 0x000055555578dd73 in PyRun_SimpleFileExFlags () at /tmp/build/80754af9/python_1553721932202/work/Python/pythonrun.c:429
#24 0x000055555578ee5f in pymain_run_file (p_cf=0x7fffffffd9e0, filename=0x5555558c63e0 L"test_shadow_light.py", fp=0x555555948360)
at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:427
#25 pymain_run_filename (cf=0x7fffffffd9e0, pymain=0x7fffffffdaf0) at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:1627
#26 pymain_run_python (pymain=0x7fffffffdaf0) at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:2877
#27 pymain_main () at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:3038
#28 0x000055555578ef7c in _Py_UnixMain () at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:3073
#29 0x00007ffff7810830 in __libc_start_main (main=0x55555564aed0 <main>, argc=2, argv=0x7fffffffdc48, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7fffffffdc38) at ../csu/libc-start.c:291
#30 0x0000555555734122 in _start () at ../sysdeps/x86_64/elf/start.S:103
Then I tried to set CUDA_LAUNCH_BLOCKING=1
and this actually circumvents the issue. Is that an acceptable solution or does it come with compromises (e.g. performance)?
It's indeed a synchronization issue. Most likely we access a unified memory on CPU while another GPU kernel is executing. This only results in segmentation fault/bus error in pre-Pascal devices so I didn't notice this. I pushed a fix, does the latest commit fix your problem?
Using CUDA_LAUNCH_BLOCKING=1 indeed compromises performance since redner launches a lot of kernels during rendering. It is good for debugging though.
Yes, it works. Thanks for the swift fix!
I'm hitting a bus error that looks related to issue #3 but isn't related to assertions.
First, running an example line by line, the crash happens at
img = render(0, *args)
.So I followed the same steps as described in #3 . Running via gdb gives:
My setup:
Might it be another synchronization issue? Thanks in advance!