BachiLi / redner

Differentiable rendering without approximation.
https://people.csail.mit.edu/tzumao/diffrt/
MIT License
1.39k stars 139 forks source link

Bus error #42

Closed mguillau closed 5 years ago

mguillau commented 5 years ago

I'm hitting a bus error that looks related to issue #3 but isn't related to assertions.

First, running an example line by line, the crash happens at img = render(0, *args) .

So I followed the same steps as described in #3 . Running via gdb gives:

(gdb) run test_shadow_light.py
Starting program: /home/ubuntu/miniconda3/bin/python test_shadow_light.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffa2ae7700 (LWP 19335)]
[New Thread 0x7fff894f9700 (LWP 19336)]
Scene construction, time: 0.07067 s
[New Thread 0x7fff836e7700 (LWP 19337)]
[New Thread 0x7fff8236b700 (LWP 19338)]
[New Thread 0x7fff81b6a700 (LWP 19339)]

Thread 1 "python" received signal SIGBUS, Bus error.
ChannelInfo::ChannelInfo (this=0x7fffffffb760, channels=..., use_gpu=<optimized out>) at /home/ubuntu/src/redner/channels.cpp:25
25              this->channels[i] = channels[i];
(gdb) p channels
$1 = (const std::vector<Channels, std::allocator<Channels> > &) @0x5555a8b80520: {<std::_Vector_base<Channels, std::allocator<Channels> >> = {
    _M_impl = {<std::allocator<Channels>> = {<__gnu_cxx::new_allocator<Channels>> = {<No data fields>}, <No data fields>}, 
      _M_start = 0x555557e38d60, _M_finish = 0x555557e38d64, _M_end_of_storage = 0x555557e38d64}}, <No data fields>}
(gdb) p i
$2 = 1
(gdb) p *this
$3 = {channels = 0xb02729000, num_channels = 1, num_total_dimensions = 3, radiance_dimension = 0, use_gpu = true}

My setup:

Might it be another synchronization issue? Thanks in advance!

BachiLi commented 5 years ago

Can you backtrace and show the stacktrace?

mguillau commented 5 years ago

Here's the output of backtrace:

(gdb) run test_shadow_light.py
Starting program: /home/ubuntu/miniconda3/bin/python test_shadow_light.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffa2ae4700 (LWP 1132)]
[New Thread 0x7fff894f7700 (LWP 1133)]
Scene construction, time: 0.07212 s
[New Thread 0x7fff88bf6700 (LWP 1134)]
[New Thread 0x7fff82818700 (LWP 1135)]
[New Thread 0x7fff80d02700 (LWP 1136)]

Thread 1 "python" received signal SIGBUS, Bus error.
ChannelInfo::ChannelInfo (this=0x7fffffffb890, channels=..., use_gpu=<optimized out>) at /home/ubuntu/src/redner/channels.cpp:25
25              this->channels[i] = channels[i];
(gdb) bt
#0  ChannelInfo::ChannelInfo (this=0x7fffffffb890, channels=..., use_gpu=<optimized out>) at /home/ubuntu/src/redner/channels.cpp:25
#1  0x00007fffa19f6a81 in render (scene=..., options=..., rendered_image=..., d_rendered_image=..., d_scene=..., debug_image=...)
    at /home/ubuntu/src/redner/pathtracer.cpp:390
#2  0x00007fffa1976e88 in pybind11::detail::argument_loader<Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float> >::call_impl<void, void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, pybind11::detail::void_type>(void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul>, pybind11::detail::void_type&&) (f=<optimized out>, this=0x7fffffffcd70)
    at /home/ubuntu/miniconda3/include/python3.7m/pybind11/cast.h:1874
#3  pybind11::detail::argument_loader<Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float> >::call<void, pybind11::detail::void_type, void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>)>(void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>)) && (f=<optimized out>, this=<optimized out>)
    at /home/ubuntu/miniconda3/include/python3.7m/pybind11/cast.h:1856
#4  void pybind11::cpp_function::initialize<void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), void, Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>, pybind11::name, pybind11::scope, pybind11::sibling, char [1]>(void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), void (*)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [1])::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call&) const (call=..., 
    __closure=0x0) at /home/ubuntu/miniconda3/include/python3.7m/pybind11/pybind11.h:154
#5  void pybind11::cpp_function::initialize<void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), void, Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>, pybind11::name, pybind11::scope, pybind11::sibling, char [1]>(void (*&)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), void (*)(Scene const&, RenderOptions const&, ptr<float>, ptr<float>, std::shared_ptr<DScene>, ptr<float>), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [1])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
    at /home/ubuntu/miniconda3/include/python3.7m/pybind11/pybind11.h:132
#6  0x00007fffa193bfcc in pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7ffff6d7ed08, kwargs_in=0x0)
    at /home/ubuntu/miniconda3/include/python3.7m/pybind11/pybind11.h:627
#7  0x00005555556cd6e4 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:690
#8  0x00005555556cd801 in _PyCFunction_FastCallKeywords (func=0x7fffa3ece750, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>)
    at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:730
#9  0x00005555557292bc in call_function (kwnames=0x0, oparg=6, pp_stack=<synthetic pointer>)
    at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4568
#10 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3093
#11 0x000055555566a4f9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3930
#12 0x000055555566b5d5 in _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:376
#13 0x00007fffe8cc9ce9 in THPFunction_apply(_object*, _object*) ()
   from /home/ubuntu/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#14 0x0000555555690be7 in cfunction_call_varargs (kwargs=<optimized out>, args=<optimized out>, func=0x7fff8a2f41b0)
    at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:768
#15 PyCFunction_Call () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:784
#16 0x000055555572a151 in do_call_core (kwdict=0x0, callargs=0x555557e38468, func=0x7fff8a2f41b0)
    at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4641
#17 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3191
#18 0x000055555566a4f9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3930
#19 0x000055555566b3c4 in PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3959
#20 0x000055555566b3ec in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
    at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:524
#21 0x0000555555783874 in run_mod () at /tmp/build/80754af9/python_1553721932202/work/Python/pythonrun.c:1035
#22 0x000055555578db81 in PyRun_FileExFlags () at /tmp/build/80754af9/python_1553721932202/work/Python/pythonrun.c:988
#23 0x000055555578dd73 in PyRun_SimpleFileExFlags () at /tmp/build/80754af9/python_1553721932202/work/Python/pythonrun.c:429
#24 0x000055555578ee5f in pymain_run_file (p_cf=0x7fffffffd9e0, filename=0x5555558c63e0 L"test_shadow_light.py", fp=0x555555948360)
    at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:427
#25 pymain_run_filename (cf=0x7fffffffd9e0, pymain=0x7fffffffdaf0) at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:1627
#26 pymain_run_python (pymain=0x7fffffffdaf0) at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:2877
#27 pymain_main () at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:3038
#28 0x000055555578ef7c in _Py_UnixMain () at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:3073
#29 0x00007ffff7810830 in __libc_start_main (main=0x55555564aed0 <main>, argc=2, argv=0x7fffffffdc48, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffdc38) at ../csu/libc-start.c:291
#30 0x0000555555734122 in _start () at ../sysdeps/x86_64/elf/start.S:103

Then I tried to set CUDA_LAUNCH_BLOCKING=1 and this actually circumvents the issue. Is that an acceptable solution or does it come with compromises (e.g. performance)?

BachiLi commented 5 years ago

It's indeed a synchronization issue. Most likely we access a unified memory on CPU while another GPU kernel is executing. This only results in segmentation fault/bus error in pre-Pascal devices so I didn't notice this. I pushed a fix, does the latest commit fix your problem?

Using CUDA_LAUNCH_BLOCKING=1 indeed compromises performance since redner launches a lot of kernels during rendering. It is good for debugging though.

mguillau commented 5 years ago

Yes, it works. Thanks for the swift fix!