catid / wirehair

Wirehair : O(N) Fountain Code for Large Data
http://wirehairfec.com
BSD 3-Clause "New" or "Revised" License
265 stars 56 forks source link

Python demo script causes segfault with Python2 #19

Open courtarro opened 4 years ago

courtarro commented 4 years ago

Running on 12-thread i7 in 64-bit Linux (Ubuntu Bionic). Compiled and installed libwirehair-shared.so and ran python2 whirehair.py:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5d6937e in wirehair::Codec::Encode (this=0x55b6ba50, block_id=1, block_out=0x555555bb2ef0, out_buffer_bytes=32)
    at /home/(redacted)/software/external/wirehair/WirehairCodec.cpp:4051
4051        if ((uint16_t)block_id == _block_count - 1) {

GDB stack trace:

#0  0x00007ffff5d6937e in wirehair::Codec::Encode (this=0x55b6ba50, block_id=1, block_out=0x555555bb2ef0, out_buffer_bytes=32)
    at /home/(redacted)/software/external/wirehair/WirehairCodec.cpp:4051
#1  0x00007ffff5d59af4 in wirehair_encode (codec=0x55b6ba50, blockId=1, blockDataOut=0x555555bb2ef0, outBytes=32, dataBytesOut=0x7ffff7ec9910)
    at /home/(redacted)/software/external/wirehair/wirehair.cpp:139
#2  0x00007ffff5f9bdae in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#3  0x00007ffff5f9b71f in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#4  0x00007ffff61aead4 in _ctypes_callproc () from /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so
#5  0x00007ffff61ae4d5 in ?? () from /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so
#6  0x000055555564df9e in PyEval_EvalFrameEx ()
#7  0x0000555555646b0a in PyEval_EvalCodeEx ()
#8  0x0000555555646429 in PyEval_EvalCode ()
#9  0x00005555556764cf in ?? ()
#10 0x0000555555671442 in PyRun_FileExFlags ()
#11 0x00005555556708bd in PyRun_SimpleFileExFlags ()
#12 0x000055555562075b in Py_Main ()
#13 0x00007ffff7a05b97 in __libc_start_main (main=0x5555556200c0 <main>, argc=2, argv=0x7fffffffde28, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fffffffde18) at ../csu/libc-start.c:310
#14 0x000055555561ffda in _start ()

Works fine in Python 3. I am currently debugging.

courtarro commented 4 years ago

This is really weird. I expanded line 4051, which was triggering the segfault:

if ((uint16_t)block_id == _block_count - 1) {

to the following 4 lines:

uint16_t bc = _block_count;
uint16_t last_block = bc - 1;
uint16_t bid_u16 = (uint16_t)block_id;
if (bid_u16 == last_block) {

Now the segfault happens at the very first line, when attempting to read the value of _block_count. I don't understand why it would be unable to read that variable.

0x00007ffff5d69375 in wirehair::Codec::Encode (this=0x55b6ba50, block_id=1, block_out=0x555555bb2ef0, out_buffer_bytes=32)
    at /home/(redacted)/software/external/wirehair/WirehairCodec.cpp:4051
4051        uint16_t bc = _block_count;

GDB is also unable to read it. Here is the attempt to read block_id, which works, and _block_count, which doesn't:

(gdb) print block_id
$1 = 1
(gdb) print _block_count
Cannot access memory at address 0x55b6ba54
danieagle commented 4 years ago

Hi! Courtarro! From your gdb try use first (lines 97 and 98) blockid = ctypes.c_uint16(0) needed = ctypes.c_uint16(0)

worked ? if yes, pleeaase try change line 116 to: ctypes.c_uint16(blockid.value), #ID of block to generate

Thanks For the patience! :-)

[]'s Dani.

courtarro commented 3 years ago

I finally got around to trying this. I replaced the above listed mentions of c_uint() with c_uint16() as well as another place where c_int() was used (substituted c_int32() in that case). Still segfaults.

catid commented 3 years ago

If it's segfaulting probably the best way to debug is to build in debug mode and attach a debugger to it. Probably some input is invalid to the C++ code.

courtarro commented 3 years ago

I'm not an expert at ctypes. Python thinks the encoder variable is the default c_int, rather than a full WirehairCodec object. Any reason that might confuse the garbage collection process? The variable stays in scope, so I don't think that would be it. But gdb is unable to access any member variable of the WirehairCodec object, which leads me to believe there's some sort of memory corruption going on.

With Python 2.7 going away, I'm not that worried about whether it works with Python 2.7 in the long term. My original motivation was to use this with GnuRadio 3.7, which is P2.7-based, and GR has since moved to Python 3. However, I'd like to better understand the problem in case it's actually just revealing a more serious underlying issue and P3 happens not to trigger it, but could end up failing later.

catid commented 3 years ago

I read some ctypes docs. I think what might be missing is this:

wirehair.wirehair_encoder_create.restype = ctypes.c_void_p

Maybe also need to wrap it like this: c_void_p(wirehair.wirehair_encoder_create(...))

What may be happening is the default type is a 32-bit integer, which truncated the 64-bit pointer from the library. Passing it back in would lead to invalid memory access as you described...