amazon-ion / ion-python

A Python implementation of Amazon Ion.
https://amazon-ion.github.io/ion-docs/
Apache License 2.0
260 stars 51 forks source link

Segmentation Fault when issuing multiple `loads` #347

Closed nirosys closed 7 months ago

nirosys commented 7 months ago

There is an issue that seems to present when using multiple loads in a python session, or script, that results in a segmentation fault.

Repro

At revision da632c1.

(venv) $ python3
Python 3.12.2 (main, Feb  6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import amazon.ion.simpleion as ion
>>> ion_data = ion.loads("{foo: bar}", text_buffer_size=60000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: loads() got an unexpected keyword argument 'text_buffer_size'
>>> ion_data = ion.loads("{foo: bar}", text_buffer_size_limit=60000)
>>> ion_data = ion.loads("{foo: bar}", text_buffer_size_limit=60000)
fish: Job 1, 'python3' terminated by signal SIGSEGV (Address boundary error)
$

Investigation

(venv) $ lldb python3
(lldb) target create "python3"
Current executable set to '/usr/local/bin/python3' (x86_64).
(lldb) run
Process 63750 launched: '/usr/local/bin/python3' (x86_64)
Process 63750 stopped
* thread #2, stop reason = exec
    frame #0: 0x000000010000e070 dyld`_dyld_start
dyld`:
->  0x10000e070 <+0>:  movq   %rsp, %rdi
    0x10000e073 <+3>:  andq   $-0x10, %rsp
    0x10000e077 <+7>:  movq   $0x0, %rbp
    0x10000e07e <+14>: pushq  $0x0
Target 0: (Python) stopped.
(lldb) c
Process 63750 resuming
Python 3.12.2 (main, Feb  6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import amazon.ion.simpleion as ion
>>> ion_data = ion.loads("{foo: bar}", text_buffer_size_limit=60000)
>>> ion_data = ion.loads("{foo: bar}", text_buffer_size_limit=60000)
Process 63750 stopped
* thread #2, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffffffffffff)
    frame #0: 0x000000010079d797 Python`pymalloc_alloc + 75
Python`pymalloc_alloc:
->  0x10079d797 <+75>: movq   (%rax), %rsi
    0x10079d79a <+78>: movq   %rsi, 0x8(%rcx)
    0x10079d79e <+82>: testq  %rsi, %rsi
    0x10079d7a1 <+85>: je     0x10079d86a               ; <+286>
Target 0: (Python) stopped.
(lldb) bt
* thread #2, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffffffffffff)
  * frame #0: 0x000000010079d797 Python`pymalloc_alloc + 75
    frame #1: 0x000000010079c7f0 Python`_PyObject_Malloc + 60
    frame #2: 0x000000010084f8a9 Python`PyLong_FromVoidPtr + 144
    frame #3: 0x00000001006897ad Python`symtable_enter_block + 42
    frame #4: 0x00000001006f89ae Python`_PySymtable_Build + 233
    frame #5: 0x00000001006cc3fe Python`new_compiler.llvm.12878828316118427640 + 300
    frame #6: 0x00000001007661ff Python`_PyAST_Compile + 18
    frame #7: 0x00000001006f555e Python`run_mod.llvm.3794556486307325143 + 51
    frame #8: 0x00000001007a9196 Python`PyRun_InteractiveOneObjectEx + 604
    frame #9: 0x00000001006f4938 Python`_PyRun_InteractiveLoopObject + 149
    frame #10: 0x000000010061ffeb Python`_PyRun_AnyFileObject + 63
    frame #11: 0x000000010068663b Python`PyRun_AnyFileExFlags + 58
    frame #12: 0x0000000100705bf7 Python`pymain_run_stdin + 145
    frame #13: 0x00000001007d4e0e Python`Py_RunMain + 553
    frame #14: 0x00000001007054c9 Python`Py_BytesMain + 42
    frame #15: 0x00007ff80114c41f dyld`start + 1903
(lldb)

Another stack trace, when reading a dataset that initially brought the issue to our attention.

Target 0: (Python) stopped.
(lldb) bt
error: ionc.cpython-312-darwin.so debug map object file "/Users/glitch/Code/ion-python/build/temp.macosx-13.0-x86_64-cpython-312/amazon/ion/ioncmodule.o" changed
 (actual: 0x66060bf9, debug map: 0x6605fcbe) since this executable was linked, debug info will not be loaded
* thread #2, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x2)
  * frame #0: 0x000000010079d797 Python`pymalloc_alloc + 75
    frame #1: 0x000000010079e77a Python`_PyObject_Realloc + 315
    frame #2: 0x000000010072b6c4 Python`PyList_Append + 198
    frame #3: 0x0000000104f941ab ionc.cpython-312-darwin.so`ionc_add_to_container + 123
    frame #4: 0x0000000104f9364a ionc.cpython-312-darwin.so`ionc_read_value + 3146
    frame #5: 0x0000000104f942ad ionc.cpython-312-darwin.so`ionc_read_all + 109
    frame #6: 0x0000000104f9405c ionc.cpython-312-darwin.so`ionc_read_into_container + 108
    frame #7: 0x0000000104f93501 ionc.cpython-312-darwin.so`ionc_read_value + 2817
    frame #8: 0x0000000104f942ad ionc.cpython-312-darwin.so`ionc_read_all + 109
    frame #9: 0x0000000104f9405c ionc.cpython-312-darwin.so`ionc_read_into_container + 108
    frame #10: 0x0000000104f93592 ionc.cpython-312-darwin.so`ionc_read_value + 2962
    frame #11: 0x0000000104f942ad ionc.cpython-312-darwin.so`ionc_read_all + 109
    frame #12: 0x0000000104f9405c ionc.cpython-312-darwin.so`ionc_read_into_container + 108
    frame #13: 0x0000000104f93501 ionc.cpython-312-darwin.so`ionc_read_value + 2817
    frame #14: 0x0000000104f942ad ionc.cpython-312-darwin.so`ionc_read_all + 109
    frame #15: 0x0000000104f9405c ionc.cpython-312-darwin.so`ionc_read_into_container + 108
    frame #16: 0x0000000104f93501 ionc.cpython-312-darwin.so`ionc_read_value + 2817
    frame #17: 0x0000000104f942ad ionc.cpython-312-darwin.so`ionc_read_all + 109
    frame #18: 0x0000000104f9405c ionc.cpython-312-darwin.so`ionc_read_into_container + 108
    frame #19: 0x0000000104f93501 ionc.cpython-312-darwin.so`ionc_read_value + 2817
    frame #20: 0x0000000104f945e4 ionc.cpython-312-darwin.so`ionc_read_iter_next + 244
    frame #21: 0x000000010075d27e Python`builtin_next + 74
    frame #22: 0x00000001006c6658 Python`_PyEval_EvalFrameDefault + 52597
    frame #23: 0x00000001006b978b Python`PyEval_EvalCode + 197
    frame #24: 0x0000000100623bea Python`run_eval_code_obj.llvm.3794556486307325143 + 83
    frame #25: 0x00000001006f5596 Python`run_mod.llvm.3794556486307325143 + 107
    frame #26: 0x00000001006f506d Python`pyrun_file + 133
    frame #27: 0x00000001006f4b9a Python`_PyRun_SimpleFileObject + 287
    frame #28: 0x0000000100620040 Python`_PyRun_AnyFileObject + 148
    frame #29: 0x00000001007d5b81 Python`pymain_run_file_obj + 226
    frame #30: 0x000000010068f9b5 Python`pymain_run_file + 89
    frame #31: 0x00000001007d4fc1 Python`Py_RunMain + 988
    frame #32: 0x00000001007054c9 Python`Py_BytesMain + 42
    frame #33: 0x00007ff80114c41f dyld`start + 1903

Appears to be a reference counting issue.

nirosys commented 7 months ago

Narrowed this down to a very trivial repro.

The missinggc error is important, not because it's missing, but for some reason this pushes python into a state where the invalid address is poked. Running with gc imported, does not end up resulting in a segfault.

$ cat trivial.py 
import amazon.ion.simpleion as ion

print("C Extension Enabled? ", ion.c_ext)

ion_data = ion.loads("", text_buffer_size_limit=60000)
gc.collect()
$ python3 ./trivial.py
C Extension Enabled?  True
Traceback (most recent call last):
  File "/Users/glitch/Code/ion-python/./trivial.py", line 6, in <module>
    gc.collect()
    ^^
NameError: name 'gc' is not defined. Did you forget to import 'gc'?
fish: Job 1, 'python3 ./trivial.py' terminated by signal SIGSEGV (Address boundary error)
$

This led me to examining the ionc_read method closer, to see where a reference might be mishandled. I noticed the Py_XDECREF(text_buffer_size_limit) and re-ran trivial.py without specifying a text_buffer_size_limit. This ended up running without issue.

Looking at the documentation for PyArg_ParseTupleAndKeywords the O format does not create a strong reference for the object provided. This is the opposite behavior to O when using PyArgs_BuildValue. I must have read the build fmt section twice before realizing there was a separate section for parsing, and the fmts could have different behaviors.

Eliminating the XDECREF appears to have fixed the crash, and all of my tests show multiple loads are working fine in the tests that weren't previously.