Closed jameshilliard closed 5 years ago
We haven't seen that one before. Do you know what version of libsystemd you have?
It could be line 9 of adbus/sdbus/call.pyx, it doesn't look like we are saving the slot reference that returns from an unref which could maybe cause this, you could maybe try changing it to:
self._slot = sdbus_h.sd_bus_slot_unref(self._slot)
from:
sdbus_h.sd_bus_slot_unref(call._slot)
You'll have to run setup.py with --cythonize to recreate the c file.
If you aren't able to get this working / tested we will probably have some time to run some of our own testing in a few days (maybe on the weekend).
Thanks! Charles
Do you know what version of libsystemd you have?
239-7ubuntu10.10
It could be line 9 of adbus/sdbus/call.pyx, it doesn't look like we are saving the slot reference that returns from an unref which could maybe cause this, you could maybe try changing it to:
This doesn't seem to work, there's no self
in the function:
Error compiling Cython file:
------------------------------------------------------------
...
sdbus_h.sd_bus_error *err):
cdef PyObject *call_ptr = <PyObject*>userdata
cdef Call call = <Call>call_ptr
cdef Message message = Message()
self._slot = sdbus_h.sd_bus_slot_unref(self._slot)
^
------------------------------------------------------------
adbus/sdbus/call.pyx:9:43: undeclared name not builtin: self
I'm sorry, I meant change it to:
call._slot = sdbus_h.sd_bus_slot_unref(call._slot)
That really is a bug (it may not be causing your issue, but it's a bug non-the-less). I'm going to make that change now and check it it. Let me know if helps with your issue.
call._slot = sdbus_h.sd_bus_slot_unref(call._slot)
Still seems to segfault with that.
The crash seems to be limited to one specific function call in my app:
nm_service = adbus.Service(bus='system')
@classmethod
async def get_ipv6_config(cls, path):
config = await adbus.client.get_all(
cls.nm_service,
'org.freedesktop.NetworkManager',
path,
'org.freedesktop.NetworkManager.IP6Config'
)
return config
Where path
is /org/freedesktop/NetworkManager/IP6Config/9
and await adbus.client.get_all
crashes before returning back the response. This is especially tricky to debug since I can't reliably reproduce it, it happens to only one of my developers and on one of their ubuntu development systems intermittently(although frequently enough that they can't properly use that particular system for development) and for this one specific call(my app does many other calls to network manager and when it crashes it always seems to be that specific call).
When it doesn't crash I get this for config
in the get_ipv6_config
function:
{'AddressData': [{'address': 'fe80::7962:5739:c34f:deef', 'prefix': 64}],
'Addresses': [([254,
128,
0,
0,
0,
0,
0,
0,
121,
98,
87,
57,
195,
79,
222,
239],
64,
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])],
'DnsOptions': [],
'DnsPriority': 100,
'Domains': [],
'Gateway': '',
'Nameservers': [],
'RouteData': [{'dest': 'fe80::', 'metric': 100, 'prefix': 64},
{'dest': 'ff00::', 'metric': 256, 'prefix': 8, 'table': 255}],
'Routes': [([254, 128, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
64,
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
100)],
'Searches': []}
This is what I get when using dbus-send
for that same call:
# dbus-send --system --print-reply --dest=org.freedesktop.NetworkManager /org/freedesktop/NetworkManager/IP6Config/9 org.freedesktop.DBus.Properties.GetAll string:"org.freedesktop.NetworkManager.IP6Config"
method return time=1552816171.977614 sender=:1.13 -> destination=:1.6665 serial=8969 reply_serial=2
array [
dict entry(
string "Addresses"
variant array [
struct {
array of bytes [
fe 80 00 00 00 00 00 00 79 62 57 39 c3 4f de ef
]
uint32 64
array of bytes [
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
]
}
]
)
dict entry(
string "AddressData"
variant array [
array [
dict entry(
string "address"
variant string "fe80::7962:5739:c34f:deef"
)
dict entry(
string "prefix"
variant uint32 64
)
]
]
)
dict entry(
string "Gateway"
variant string ""
)
dict entry(
string "Routes"
variant array [
struct {
array of bytes [
fe 80 00 00 00 00 00 00 00 00 00 00 00 00 00 00
]
uint32 64
array of bytes [
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
]
uint32 100
}
]
)
dict entry(
string "RouteData"
variant array [
array [
dict entry(
string "dest"
variant string "fe80::"
)
dict entry(
string "prefix"
variant uint32 64
)
dict entry(
string "metric"
variant uint32 100
)
]
array [
dict entry(
string "dest"
variant string "ff00::"
)
dict entry(
string "prefix"
variant uint32 8
)
dict entry(
string "metric"
variant uint32 256
)
dict entry(
string "table"
variant uint32 255
)
]
]
)
dict entry(
string "Nameservers"
variant array [
]
)
dict entry(
string "Domains"
variant array [
]
)
dict entry(
string "Searches"
variant array [
]
)
dict entry(
string "DnsOptions"
variant array [
]
)
dict entry(
string "DnsPriority"
variant int32 100
)
]
It's possible that we don't need line 9 of sdbus/call.pyx anymore. Some of the systemd reference code has it, and some doesn't, and it looks like there may be an additional unref after the callback is called.
You could try removing it, though, I suspect you will just push the crash a few lines further up the stack, or after the return from the callback.
Or, it's possible python is freeing the Call instance before the callback is called, to be safe we should probably increment before the callback, and then decrement after the callback, the Python reference counter for the Call object.
I'll look into that.
Or, it's possible python is freeing the Call instance before the callback is called, to be safe we should probably increment before the callback, and then decrement after the callback, the Python reference counter for the Call object.
Yeah, I'm thinking it's likely something like that, I'm not super familiar with cython myself though. Would adding a check on call._slot
along these lines make sense?
if call._slot:
sdbus_h.sd_bus_slot_unref(call._slot)
That probably isn't going to help, if call is removed by the Python Garbage Collector that memory may not be zeroed, and the library call runs the same check anyway.
The only time that it's a possibility that the Python GC has removed it is if a timeout has occurred, that's definitely a bug though, but it may not be causing your issue. I'll add reference count increments / decrement and we can see if it fixes it.
I just made this update, maybe give it a try again.
I'm going to close this for now, my developer who was seeing this segfault hasn't seen it happen on the current master branch, although due to the intermittent nature it's hard to say for sure it's fixed. Seems either your fix or my fix for the Invalid read bug fixed it.
Thanks, hopefully we got it. If it comes back feel free to re-open.
Under some environments(such as ubuntu) I seem to get this segfault. Any idea why that would be happening? backtrace: