Closed seehuhn closed 8 years ago
Could you add this somewhere in the program? import faulthandler; faulthandler.enable()
It should print a Python stack trace when crashing, which hopefully will hint at where in the code the problem is.
I tried, but not much luck:
voss@flammeri [..roject/jvplot] ./demo1.py
Fatal Python error: Segmentation fault
Current thread 0x00007fff76eba300 (most recent call first):
Segmentation fault: 11
voss@flammeri [..roject/jvplot] cat demo1.py
#! /usr/bin/env python3
import faulthandler; faulthandler.enable()
import numpy as np
from jvplot import Plot
with Plot('demo1.pdf', '4.5in', '4.5in') as fig:
fig.scatter_plot(np.random.rand(100),
np.random.rand(100),
aspect=1,
margin="1cm")
The problem seems to happen during tear-down: if I add a print statement in the last line of the script, this is still executed before the crash. Also, I simplified the code to not use the with
statement, and the crash still occurs:
voss@flammeri [..roject/jvplot] ./demo1.py
done
Fatal Python error: Segmentation fault
Current thread 0x00007fff76eba300 (most recent call first):
Segmentation fault: 11
voss@flammeri [..roject/jvplot] cat demo1.py
#! /usr/bin/env python3
import faulthandler; faulthandler.enable()
from jvplot import Plot
fig = Plot('demo1.pdf', '4.5in', '4.5in')
fig.scatter_plot([1,2,3], [1,2,3], aspect=1, margin="1cm")
fig.close()
print("done")
Note that the print("done")
did indeed produce output.
By any chance, do any of these libraries use thread locals? Try CFFI master, in case this is https://bitbucket.org/cffi/cffi/issues/223/threadinglocal-segfaults-in-ffigc-callback
If it’s not, I’m out of ideas and I’m afraid you’re gonna need a C debugger to find out where that null pointer comes from. Sounds like something where rr’s “time travel debugging” might help.
Hi Simon,
I experimented a bit, and the problem is really hard to nail down. I have now a single script which only imports numpy and cairocffi, and still exhibits the crash. But small changes to this script, e.g. removing some unused elements from a dictionary stored in a global variable, makes the crash disappear. I suspect the problem either depends on timing, or on details of the memory layout.
On 29 Sep 2015, at 14:00, Simon Sapin wrote:
By any chance, do any of these libraries use thread locals? Try CFFI master, in case this is https://bitbucket.org/cffi/cffi/issues/223/threadinglocal-segfaults-in-ffigc-callback
Ok, I'll try this next.
If it’s not, I’m out of ideas and I’m afraid you’re gonna need a C debugger to find out where that null pointer comes from. Sounds like something where rr’s “time travel debugging” might help.
Probably this will require to compile a version of Python with debug symbols included?
By the way: the invalid pointer is not a null pointer, but points just behind the valid range. If I read the MacOS X crash dump correctly, at the time of the crash valid addresses were 105f9a000-105f9b000 and 1061dc000-1061dd000, and the program tries to access 105fa7365 which is in the gap between the two regions. Maybe some use-after-free?
I'll enquire and report back if I find out anything more.
Many thanks,
I have now a single script which only imports numpy and cairocffi, and still exhibits the crash.
Can you copy that script here?
On 30 Sep 2015, at 15:49, Simon Sapin wrote:
I have now a single script which only imports numpy and cairocffi, and still exhibits the crash.
Can you copy that script here?
It's on the computer at home, I'll send it tonight.
Here is the script which crashes for me: https://gist.github.com/seehuhn/39a13ef38813d8c10c2f . This was obtained by copying all my code into one file, and removing bits until the crash disappeared. I have reached a dead end with this, though: nearly every modification makes the crash disappear for me, now. For example:
numpy
import makes the crash disappeardefault
dict makes the crash disappearself.surface = surface
near the bottom makes the crash disappear.
Suggestions how to proceed would be most welcome.I could reproduce the segfault on CPython 3.4, but not on 2.7, 3.2, or 3.3. (The program terminates normally.)
Since the segfault happens after print("done")
and involves libffi, it might be in the destructor of an object returned by ffi.gc
.
I tried replace all ffi.gc(some_pointer, some_function)
calls in cairocffi by ffi.gc(some_pointer, logging(some_function))
, and add:
def logging(f):
def wrapper(arg):
print(f, arg)
return f(arg)
return wrapper
… but then the program terminated without segfault.
Here is the stack trace I get in gdb:
#0 0x00007ffff1dcec90 in ?? ()
#1 0x00007ffff2ea11f0 in ffi_call_unix64 () from /usr/lib/libffi.so.6
#2 0x00007ffff2ea0c58 in ffi_call () from /usr/lib/libffi.so.6
#3 0x00007ffff26cc5f3 in cdata_call (cd=0x7ffff37ea788, args=<optimized out>, kwds=<optimized out>) at c/_cffi_backend.c:2536
#4 0x00007ffff79a3efb in PyObject_Call (func=func@entry=0x7ffff37ea788, arg=arg@entry=0x7ffff6a43908, kw=kw@entry=0x0)
at Objects/abstract.c:2040
#5 0x00007ffff79a4ba0 in PyObject_CallFunctionObjArgs (callable=callable@entry=0x7ffff37ea788) at Objects/abstract.c:2332
#6 0x00007ffff26c151e in gc_wref_remove (ffi_wref_tup=<optimized out>, key=0x7ffff57119f8) at c/cgc.c:38
#7 0x00007ffff79a3efb in PyObject_Call (func=func@entry=0x7ffff29a8f88, arg=arg@entry=0x7ffff6a3d0b8, kw=kw@entry=0x0)
at Objects/abstract.c:2040
#8 0x00007ffff79a4ba0 in PyObject_CallFunctionObjArgs (callable=callable@entry=0x7ffff29a8f88) at Objects/abstract.c:2332
#9 0x00007ffff7a3b25a in handle_callback (ref=ref@entry=0x7ffff57119f8, callback=callback@entry=0x7ffff29a8f88)
at Objects/weakrefobject.c:868
#10 0x00007ffff7a3e8a2 in PyObject_ClearWeakRefs (object=object@entry=0x7ffff2e57e40) at Objects/weakrefobject.c:915
#11 0x00007ffff26bff30 in cdata_dealloc (cd=0x7ffff2e57e40) at c/_cffi_backend.c:1533
#12 0x00007ffff79e1027 in dict_dealloc (mp=0x7ffff6ba6888) at Objects/dictobject.c:1383
#13 0x00007ffff79fbd23 in subtype_dealloc (self=0x7ffff6a439b0) at Objects/typeobject.c:1186
#14 0x00007ffff79e1027 in dict_dealloc (mp=0x7ffff5714e88) at Objects/dictobject.c:1383
#15 0x00007ffff79fb6f8 in subtype_clear (self=0x7ffff6a43748) at Objects/typeobject.c:1047
#16 0x00007ffff7a9eb7d in delete_garbage (old=<optimized out>, collectable=<optimized out>) at Modules/gcmodule.c:866
#17 collect (generation=generation@entry=2, n_collected=n_collected@entry=0x0, n_uncollectable=n_uncollectable@entry=0x0,
nofail=nofail@entry=1) at Modules/gcmodule.c:1032
#18 0x00007ffff7a9f904 in _PyGC_CollectNoFail () at Modules/gcmodule.c:1638
#19 0x00007ffff7a77075 in PyImport_Cleanup () at Python/import.c:483
#20 0x00007ffff7a84158 in Py_Finalize () at Python/pythonrun.c:616
#21 0x00007ffff7a9d3ef in Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:771
#22 0x0000555555554c06 in main ()
Note handle_callback
in Objects/weakrefobject.c
. cairocffi does use weak references in some cases, but this program does not use the corresponding cairocffi features. I still get the segfault when commenting out that cairocffi code.
It might mean you're not keeping callbacks alive for long enough, but I have no time to look in details right now
I have now managed to be a Python3 with debug symbols running, so I can reproduce the crash in gdb. So far I have found that self
in the call to subtype_clear
(frame 15 in your stacktrace) refers to the Plot
object:
(gdb) p *self->ob_type
$31 = {ob_base = {ob_base = {ob_refcnt = 1, ob_type = 0x1001b6770}, ob_size = 0}, tp_name = 0x1006588e8 "Plot", tp_basicsize = 32, tp_itemsize = 0,
tp_dealloc = 0x10005e110 <subtype_dealloc>, tp_print = 0x0, tp_getattr = 0x0, tp_setattr = 0x0, tp_as_async = 0x10077bf78, tp_repr = 0x100061830 <object_repr>,
tp_as_number = 0x10077bf90, tp_as_sequence = 0x10077c0c8, tp_as_mapping = 0x10077c0b0, tp_hash = 0x1000e9710 <_Py_HashPointer>, tp_call = 0x0,
tp_str = 0x100063000 <slot_tp_str>, tp_getattro = 0x100051910 <PyObject_GenericGetAttr>, tp_setattro = 0x100051b50 <PyObject_GenericSetAttr>, tp_as_buffer = 0x10077c118,
tp_flags = 284161,
tp_doc = 0x10077c170 "The Plot Class repesents a file containing a single figure.\n\n Args:\n fname (string): The name of the file the figure wil be stored\n", ' ' <repeats 12 times>, "in. Any previously existing file with this nam"..., tp_traverse = 0x1000689e0 <subtype_traverse>, tp_clear = 0x100068b20 <subtype_clear>,
tp_richcompare = 0x100061990 <object_richcompare>, tp_weaklistoffset = 24, tp_iter = 0x0, tp_iternext = 0x100051710 <_PyObject_NextNotImplemented>, tp_methods = 0x0,
tp_members = 0x10077c148, tp_getset = 0x0, tp_base = 0x10077b0f8, tp_dict = 0x10189e848, tp_descr_get = 0x0, tp_descr_set = 0x0, tp_dictoffset = 16,
tp_init = 0x100063dc0 <slot_tp_init>, tp_alloc = 0x10005d850 <PyType_GenericAlloc>, tp_new = 0x100061ab0 <object_new>, tp_free = 0x100105a50 <PyObject_GC_Del>,
tp_is_gc = 0x0, tp_bases = 0x1006586a0, tp_mro = 0x0, tp_cache = 0x0, tp_subclasses = 0x0, tp_weaklist = 0x0, tp_del = 0x0, tp_version_tag = 745, tp_finalize = 0x0}
At this time a log of the final garbage collection already has happened. Some thoughts:
surface
field of the plot object?I'll try to inquire further ...
The cffi manual states
[ffi.dlopen] returns a “library” object that gets closed when it goes out of scope. Make sure you keep the library object around as long as needed.
How does cairocffi ascertain that cairocffi.cairo
is still around by the time the cairo.cairo_surface_destroy
destructor for a cairocffi.Surface
object is called?
If I run the crashing script with Py_VerboseFlag
set to 1, I get the messages
...
# destroy cairocffi.context
# destroy cairocffi.compat
# destroy cairocffi.matrix
# destroy cairocffi._ffi
# destroy cairocffi.fonts
# destroy cairocffi.surfaces
# destroy cairocffi.patterns
# destroy _cffi_backend
# destroy cairocffi.constants
...
# destroy cairocffi
...
before (I believe) the Plot
object which holds my cairocffi.Surface
object is freed. This makes me think that maybe the code tries to call cairo.cairo_surface_destroy
after cairo is already unloaded.
I seem to be experiencing something very similar to this with cairocffi on Ubuntu 15.10.
It's similarly difficult to debug! My test program at https://github.com/sde1000/python-wayland reliably segfaults on exit.
I think the theory that cairocffi.cairo being unloaded before cairo.cairo_surface_destroy is called by the ffi.gc code is correct. ffi.gc increases the refcount of the destructor, but it looks like ffi functions do not themselves keep a reference to the library.
Doing something like the following to make sure we explicitly have a reference to the library seems to fix the segfault problem for me. I'm sure there's a tidier way of doing this, though — I'm not suggesting this be used as-is!
https://github.com/sde1000/cairocffi/commit/848cf70cd8b8bb6205ac543bf7afaae6c9ca2fde
Should be fixed by 848cf70.
Hello,
I am encountering intermittent crashes of cairocffi on MacOSX, often (always?) while the Python program is terminating. Here is one instance of this:
My program uses plain Python 3, with numpy and cairocffi as dependencies, and running similar programs which use only numpy seems not to lead to crashes. Thus I suspect the problem must be with cairocffi.
Details:
Any help in debugging this would be most welcome. If you need any other information, please let me know.
Many thanks, Jochen