Segfault in __pyx_f_5adbus_5sdbus_call_callback

jameshilliard commented 5 years ago

Under some environments(such as ubuntu) I seem to get this segfault. Any idea why that would be happening? backtrace:

(gdb) bt
#0  0x00007ffff340f2e6 in sd_bus_slot_unref () from /lib/x86_64-linux-gnu/libsystemd.so.0
#1  0x00007ffff34d1728 in __pyx_f_5adbus_5sdbus_call_callback (__pyx_v_m=0x1034330, 
    __pyx_v_userdata=0x7ffff2b77c78, __pyx_v_err=<optimized out>) at adbus/sdbus.c:21986
#2  0x00007ffff3409064 in ?? () from /lib/x86_64-linux-gnu/libsystemd.so.0
#3  0x00007ffff34d400e in __pyx_pf_5adbus_5sdbus_7Service_8process (__pyx_v_self=<optimized out>, 
    __pyx_v_self=<optimized out>) at adbus/sdbus.c:14618
#4  __pyx_pw_5adbus_5sdbus_7Service_9process (__pyx_v_self=0x7ffff3e0e830, unused=<optimized out>)
    at adbus/sdbus.c:14517
#5  0x00000000005d5a41 in _PyMethodDef_RawFastCallKeywords ()
#6  0x00000000005d7980 in _PyObject_FastCallKeywords ()
#7  0x000000000062ea6f in ?? ()
#8  0x00000000005d5ef7 in _PyMethodDef_RawFastCallDict ()
#9  0x00000000005d62da in PyCFunction_Call ()
#10 0x0000000000551e0c in _PyEval_EvalFrameDefault ()
#11 0x00000000005d68ec in _PyFunction_FastCallKeywords ()
#12 0x000000000054d21c in _PyEval_EvalFrameDefault ()
#13 0x00000000005d68ec in _PyFunction_FastCallKeywords ()
#14 0x000000000054d21c in _PyEval_EvalFrameDefault ()
#15 0x00000000005d68ec in _PyFunction_FastCallKeywords ()
#16 0x000000000054d21c in _PyEval_EvalFrameDefault ()
#17 0x00000000005d68ec in _PyFunction_FastCallKeywords ()
#18 0x000000000054d088 in _PyEval_EvalFrameDefault ()
#19 0x000000000054a8c1 in _PyEval_EvalCodeWithName ()
#20 0x000000000054cb83 in PyEval_EvalCode ()
#21 0x000000000062d012 in ?? ()
#22 0x000000000062d0ca in PyRun_FileExFlags ()
#23 0x000000000062ddd8 in PyRun_SimpleFileExFlags ()
#24 0x0000000000650ad5 in ?? ()
#25 0x0000000000650bfe in _Py_UnixMain ()
#26 0x00007ffff7bbc09b in __libc_start_main (main=0x4ba4a0 <main>, argc=2, argv=0x7fffffffd878, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd868)
    at ../csu/libc-start.c:308
#27 0x00000000005dee4a in _start ()

charleseidsness commented 5 years ago

We haven't seen that one before. Do you know what version of libsystemd you have?

It could be line 9 of adbus/sdbus/call.pyx, it doesn't look like we are saving the slot reference that returns from an unref which could maybe cause this, you could maybe try changing it to: self._slot = sdbus_h.sd_bus_slot_unref(self._slot) from: sdbus_h.sd_bus_slot_unref(call._slot)

You'll have to run setup.py with --cythonize to recreate the c file.

If you aren't able to get this working / tested we will probably have some time to run some of our own testing in a few days (maybe on the weekend).

Thanks! Charles

jameshilliard commented 5 years ago

Do you know what version of libsystemd you have?

239-7ubuntu10.10

It could be line 9 of adbus/sdbus/call.pyx, it doesn't look like we are saving the slot reference that returns from an unref which could maybe cause this, you could maybe try changing it to:

This doesn't seem to work, there's no self in the function:

Error compiling Cython file:
------------------------------------------------------------
...
        sdbus_h.sd_bus_error *err):
    cdef PyObject *call_ptr = <PyObject*>userdata
    cdef Call call = <Call>call_ptr
    cdef Message message = Message()

    self._slot = sdbus_h.sd_bus_slot_unref(self._slot)
                                          ^
------------------------------------------------------------

adbus/sdbus/call.pyx:9:43: undeclared name not builtin: self

charleseidsness commented 5 years ago

I'm sorry, I meant change it to:

call._slot = sdbus_h.sd_bus_slot_unref(call._slot)

That really is a bug (it may not be causing your issue, but it's a bug non-the-less). I'm going to make that change now and check it it. Let me know if helps with your issue.

jameshilliard commented 5 years ago

call._slot = sdbus_h.sd_bus_slot_unref(call._slot)

Still seems to segfault with that.

The crash seems to be limited to one specific function call in my app:

nm_service = adbus.Service(bus='system')
@classmethod
async def get_ipv6_config(cls, path):
    config = await adbus.client.get_all(
        cls.nm_service,
        'org.freedesktop.NetworkManager',
        path,
        'org.freedesktop.NetworkManager.IP6Config'
    )
    return config

Where path is /org/freedesktop/NetworkManager/IP6Config/9 and await adbus.client.get_all crashes before returning back the response. This is especially tricky to debug since I can't reliably reproduce it, it happens to only one of my developers and on one of their ubuntu development systems intermittently(although frequently enough that they can't properly use that particular system for development) and for this one specific call(my app does many other calls to network manager and when it crashes it always seems to be that specific call).

When it doesn't crash I get this for config in the get_ipv6_config function:

{'AddressData': [{'address': 'fe80::7962:5739:c34f:deef', 'prefix': 64}],
 'Addresses': [([254,
                 128,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 121,
                 98,
                 87,
                 57,
                 195,
                 79,
                 222,
                 239],
                64,
                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])],
 'DnsOptions': [],
 'DnsPriority': 100,
 'Domains': [],
 'Gateway': '',
 'Nameservers': [],
 'RouteData': [{'dest': 'fe80::', 'metric': 100, 'prefix': 64},
               {'dest': 'ff00::', 'metric': 256, 'prefix': 8, 'table': 255}],
 'Routes': [([254, 128, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
             64,
             [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
             100)],
 'Searches': []}

This is what I get when using dbus-send for that same call:

# dbus-send --system --print-reply --dest=org.freedesktop.NetworkManager /org/freedesktop/NetworkManager/IP6Config/9 org.freedesktop.DBus.Properties.GetAll string:"org.freedesktop.NetworkManager.IP6Config"
method return time=1552816171.977614 sender=:1.13 -> destination=:1.6665 serial=8969 reply_serial=2
   array [
      dict entry(
         string "Addresses"
         variant             array [
               struct {
                  array of bytes [
                     fe 80 00 00 00 00 00 00 79 62 57 39 c3 4f de ef
                  ]
                  uint32 64
                  array of bytes [
                     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
                  ]
               }
            ]
      )
      dict entry(
         string "AddressData"
         variant             array [
               array [
                  dict entry(
                     string "address"
                     variant                         string "fe80::7962:5739:c34f:deef"
                  )
                  dict entry(
                     string "prefix"
                     variant                         uint32 64
                  )
               ]
            ]
      )
      dict entry(
         string "Gateway"
         variant             string ""
      )
      dict entry(
         string "Routes"
         variant             array [
               struct {
                  array of bytes [
                     fe 80 00 00 00 00 00 00 00 00 00 00 00 00 00 00
                  ]
                  uint32 64
                  array of bytes [
                     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
                  ]
                  uint32 100
               }
            ]
      )
      dict entry(
         string "RouteData"
         variant             array [
               array [
                  dict entry(
                     string "dest"
                     variant                         string "fe80::"
                  )
                  dict entry(
                     string "prefix"
                     variant                         uint32 64
                  )
                  dict entry(
                     string "metric"
                     variant                         uint32 100
                  )
               ]
               array [
                  dict entry(
                     string "dest"
                     variant                         string "ff00::"
                  )
                  dict entry(
                     string "prefix"
                     variant                         uint32 8
                  )
                  dict entry(
                     string "metric"
                     variant                         uint32 256
                  )
                  dict entry(
                     string "table"
                     variant                         uint32 255
                  )
               ]
            ]
      )
      dict entry(
         string "Nameservers"
         variant             array [
            ]
      )
      dict entry(
         string "Domains"
         variant             array [
            ]
      )
      dict entry(
         string "Searches"
         variant             array [
            ]
      )
      dict entry(
         string "DnsOptions"
         variant             array [
            ]
      )
      dict entry(
         string "DnsPriority"
         variant             int32 100
      )
   ]

charleseidsness commented 5 years ago

It's possible that we don't need line 9 of sdbus/call.pyx anymore. Some of the systemd reference code has it, and some doesn't, and it looks like there may be an additional unref after the callback is called.

You could try removing it, though, I suspect you will just push the crash a few lines further up the stack, or after the return from the callback.

charleseidsness commented 5 years ago

Or, it's possible python is freeing the Call instance before the callback is called, to be safe we should probably increment before the callback, and then decrement after the callback, the Python reference counter for the Call object.

I'll look into that.

jameshilliard commented 5 years ago

Or, it's possible python is freeing the Call instance before the callback is called, to be safe we should probably increment before the callback, and then decrement after the callback, the Python reference counter for the Call object.

Yeah, I'm thinking it's likely something like that, I'm not super familiar with cython myself though. Would adding a check on call._slot along these lines make sense?

if call._slot:
    sdbus_h.sd_bus_slot_unref(call._slot)

charleseidsness commented 5 years ago

That probably isn't going to help, if call is removed by the Python Garbage Collector that memory may not be zeroed, and the library call runs the same check anyway.

The only time that it's a possibility that the Python GC has removed it is if a timeout has occurred, that's definitely a bug though, but it may not be causing your issue. I'll add reference count increments / decrement and we can see if it fixes it.

charleseidsness commented 5 years ago

I just made this update, maybe give it a try again.

jameshilliard commented 5 years ago

I'm going to close this for now, my developer who was seeing this segfault hasn't seen it happen on the current master branch, although due to the intermittent nature it's hard to say for sure it's fixed. Seems either your fix or my fix for the Invalid read bug fixed it.

charleseidsness commented 5 years ago

Thanks, hopefully we got it. If it comes back feel free to re-open.

ccxtechnologies / adbus

Segfault in __pyx_f_5adbus_5sdbus_call_callback #42