byllyfish / finsy

P4Runtime Controller Library
Apache License 2.0
9 stars 2 forks source link

Python 3.11 segmentation fault in grpc test; seg. fault in task.get_stack(limit=1) accessing frame.f_back #96

Closed byllyfish closed 1 year ago

byllyfish commented 1 year ago

Using grpcio 1.50.0 and Python 3.11.0, the test test_switch1 will fail with a segmentation fault.

Current thread 0x000000011678b600 (most recent call first):
  File "/usr/local/Cellar/python@3.11/3.11.0/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_tasks.py", line 53 in _task_get_stack
  File "/Users/bfish/code/finsy/.venv_311/lib/python3.11/site-packages/grpc/aio/_channel.py", line 315 in _close
  File "/Users/bfish/code/finsy/.venv_311/lib/python3.11/site-packages/grpc/aio/_channel.py", line 374 in close
  File "/Users/bfish/code/finsy/finsy/p4client.py", line 308 in close
  File "/Users/bfish/code/finsy/finsy/switch.py", line 393 in _run

If I disable the test P4RuntimeServer with an environment variable, the test succeeds.

It seems like there is a memory misbehavior bug in the grpc async server implementation, but client code is tripping over it in _task_get_stack. If I run client code without running any of the server test code, there is no segmentation fault.

For now, the CI tests for Python 3.11 are running with FINSY_TEST_P4RUNTIME_SERVER=skip and FINSY_TEST_GNMI_SERVER=skip.

byllyfish commented 1 year ago

See grpc/grpc#31441

byllyfish commented 1 year ago

Here's a smaller reproducer. I think the problem is that GRPC has close code that calls task.get_stack(limit=1). This seg-faults with certain tasks, perhaps because they are created with cython?

import asyncio

import finsy as fy
from finsy.test.gnmi_server import GNMIServer

def get_stack():
    tasks = asyncio.all_tasks()
    for task in tasks:
        print(task)
        task.get_stack(limit=1)  # seg faults: _task_get_stack

async def main():
    async with GNMIServer("127.0.0.1:19559").run():
        async with fy.GNMIClient("127.0.0.1:19559") as client:
            sub = client.subscribe()
            sub.once(fy.GNMIPath("/"))
            await sub._subscribe()

            get_stack()

asyncio.run(main())

The output on my machine looks like this:

$ python -X dev demo2.py
<Task pending name='Task-1' coro=<main() running at /Users/bfish/code/finsy/demo2.py:21> cb=[_run_until_complete_cb() at /Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/base_events.py:180] created at /Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/runners.py:100>
<Task pending name='Task-2' coro=<AioServer._server_main_loop()> wait_for=<Future pending cb=[Task.task_wakeup()] created at /Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/base_events.py:427> cb=[AioServer._serving_task_crash_handler()] created at /Users/bfish/code/finsy/.venv311/lib/python3.11/site-packages/grpc/aio/_server.py:112>
<Task pending name='Task-6' coro=<_handle_exceptions() running at src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi:None> wait_for=<Future finished result=None created at /Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/base_events.py:427> cb=[_add_callback_handler.<locals>.handle_callbacks()] created at /Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/events.py:80>
Fatal Python error: Segmentation fault

Thread 0x0000700002207000 (most recent call first):
  File "/Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/threading.py", line 975 in run
  File "/Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/threading.py", line 995 in _bootstrap

Current thread 0x000000010f26c600 (most recent call first):
  File "/Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/base_tasks.py", line 53 in _task_get_stack
  File "/Users/bfish/code/finsy/demo2.py", line 11 in get_stack
  File "/Users/bfish/code/finsy/demo2.py", line 21 in main
  File "/Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/events.py", line 80 in _run
  File "/Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/base_events.py", line 1904 in _run_once
  File "/Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/base_events.py", line 607 in run_forever
  File "/Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/base_events.py", line 640 in run_until_complete
  File "/Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/runners.py", line 118 in run
  File "/Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/runners.py", line 190 in run
  File "/Users/bfish/code/finsy/demo2.py", line 24 in <module>

Extension modules: grpc._cython.cygrpc, google._upb._message (total: 2)
[1]    20930 segmentation fault  python -X dev demo2.py

Note that the offending task's coroutine is cython...

<Task pending name='Task-6' coro=<_handle_exceptions() running at src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi:None> wait_for=<Future finished result=None created at /Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/base_events.py:427> cb=[_add_callback_handler.<locals>.handle_callbacks()] created at /Users/bfish/.pyenv/versions/3.11-dev/lib/python3.11/asyncio/events.py:80>
Fatal Python error: Segmentation fault
byllyfish commented 1 year ago

There is a Python 3.11 bug that may be related: python/cpython#99110.

The seg. fault happens on this line of code trying to access f_back.

https://github.com/python/cpython/blob/d75b2e2fc865af0a020a12b23383d8061e9c6e59/Lib/asyncio/base_tasks.py#L53

byllyfish commented 1 year ago

Still an invalid memory access on Python 3.11.1. Here's the same program with python 3.11.1 running under valgrind on ubuntu:

$ valgrind --track-origins=yes python -X dev demo.py 
==19645== Memcheck, a memory error detector
==19645== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==19645== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==19645== Command: python -X dev demo.py
==19645== 
Executing <Task pending name='Task-1' coro=<main() running at /home/bfish/code/finsy/demo.py:15> wait_for=<Future pending cb=[Task.task_wakeup()] created at /home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py:427> cb=[_run_until_complete_cb() at /home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py:180] created at /home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/runners.py:100> took 0.923 seconds
Executing <Task pending name='Task-1' coro=<main() running at /home/bfish/code/finsy/demo.py:19> wait_for=<Future pending cb=[Task.task_wakeup()] created at /home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py:427> cb=[_run_until_complete_cb() at /home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py:180] created at /home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/runners.py:100> took 0.252 seconds
Executing <Task pending name='Task-3' coro=<StreamStreamCall._prepare_rpc() running at /home/bfish/code/finsy/.venv/lib/python3.11/site-packages/grpc/aio/_call.py:643> wait_for=<Future pending cb=[Task.task_wakeup()] created at /home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py:427> created at /home/bfish/code/finsy/.venv/lib/python3.11/site-packages/grpc/aio/_call.py:632> took 0.306 seconds
<Task pending name='Task-2' coro=<AioServer._server_main_loop()> wait_for=<Future pending cb=[Task.task_wakeup()] created at /home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py:427> cb=[AioServer._serving_task_crash_handler()] created at /home/bfish/code/finsy/.venv/lib/python3.11/site-packages/grpc/aio/_server.py:112>
<Task pending name='Task-4' coro=<_AioCall._handle_status_once_received()> wait_for=<Future pending cb=[Task.task_wakeup()] created at /home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py:427> created at /home/bfish/code/finsy/.venv/lib/python3.11/site-packages/grpc/aio/_call.py:643>
<Task pending name='Task-6' coro=<_handle_exceptions() running at src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi:None> wait_for=<Future finished result=None created at /home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py:427> cb=[_add_callback_handler.<locals>.handle_callbacks()] created at /home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/events.py:80>
==19645== Invalid read of size 1
==19645==    at 0x2859F0: _PyFrame_IsIncomplete (pycore_frame.h:147)
==19645==    by 0x2859F0: PyFrame_GetBack (frameobject.c:1326)
==19645==    by 0x2859F0: frame_getback (frameobject.c:103)
==19645==    by 0x2BBFAE: _PyObject_GenericGetAttrWithDict (object.c:1278)
==19645==    by 0x2BB46A: PyObject_GetAttr (object.c:916)
==19645==    by 0x1F9B17: _PyEval_EvalFrameDefault (ceval.c:3464)
==19645==    by 0x3699E0: _PyEval_EvalFrame (pycore_ceval.h:73)
==19645==    by 0x3699E0: _PyEval_Vector (ceval.c:6435)
==19645==    by 0x25F573: _PyObject_VectorcallTstate (pycore_call.h:92)
==19645==    by 0x25F573: object_vacall (call.c:819)
==19645==    by 0x261F3B: PyObject_CallFunctionObjArgs (call.c:925)
==19645==    by 0x6624E95: _asyncio_Task_get_stack_impl (_asynciomodule.c:2304)
==19645==    by 0x6624E95: _asyncio_Task_get_stack (_asynciomodule.c.h:518)
==19645==    by 0x26D83F: method_vectorcall_FASTCALL_KEYWORDS (descrobject.c:426)
==19645==    by 0x25FD9E: _PyObject_VectorcallTstate (pycore_call.h:92)
==19645==    by 0x25FD9E: PyObject_Vectorcall (call.c:299)
==19645==    by 0x1FBA61: _PyEval_EvalFrameDefault (ceval.c:4772)
==19645==    by 0x279E04: _PyEval_EvalFrame (pycore_ceval.h:73)
==19645==    by 0x279E04: gen_send_ex2 (genobject.c:219)
==19645==    by 0x279E04: PyGen_am_send (genobject.c:280)
==19645==  Address 0xcdcdcdcdcdcdce12 is not stack'd, malloc'd or (recently) free'd
==19645== 
Fatal Python error: Segmentation fault

Thread 0x000000000aabd640 (most recent call first):
  File "/home/bfish/.pyenv/versions/3.11.1/lib/python3.11/threading.py", line 975 in run
  File "/home/bfish/.pyenv/versions/3.11.1/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/home/bfish/.pyenv/versions/3.11.1/lib/python3.11/threading.py", line 995 in _bootstrap

Current thread 0x0000000004b72b80 (most recent call first):
  File "/home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_tasks.py", line 53 in _task_get_stack
  File "/home/bfish/code/finsy/demo.py", line 11 in get_stack
  File "/home/bfish/code/finsy/demo.py", line 21 in main
  File "/home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/events.py", line 80 in _run
  File "/home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 1911 in _run_once
  File "/home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 607 in run_forever
  File "/home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 640 in run_until_complete
  File "/home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/runners.py", line 118 in run
  File "/home/bfish/.pyenv/versions/3.11.1/lib/python3.11/asyncio/runners.py", line 190 in run
  File "/home/bfish/code/finsy/demo.py", line 24 in <module>

Extension modules: grpc._cython.cygrpc, google._upb._message (total: 2)
==19645== 
==19645== Process terminating with default action of signal 11 (SIGSEGV)
==19645==    at 0x49E0A7C: __pthread_kill_implementation (pthread_kill.c:44)
==19645==    by 0x49E0A7C: __pthread_kill_internal (pthread_kill.c:78)
==19645==    by 0x49E0A7C: pthread_kill@@GLIBC_2.34 (pthread_kill.c:89)
==19645==    by 0x498C475: raise (raise.c:26)
==19645==    by 0x498C51F: ??? (in /usr/lib/x86_64-linux-gnu/libc.so.6)
==19645==    by 0x2859EF: PyFrame_GetBack (frameobject.c:1326)
==19645==    by 0x2859EF: frame_getback (frameobject.c:103)
==19645== 
==19645== HEAP SUMMARY:
==19645==     in use at exit: 6,454,922 bytes in 7,966 blocks
==19645==   total heap usage: 19,529 allocs, 11,563 frees, 30,342,880 bytes allocated
==19645== 
==19645== LEAK SUMMARY:
==19645==    definitely lost: 0 bytes in 0 blocks
==19645==    indirectly lost: 0 bytes in 0 blocks
==19645==      possibly lost: 5,983,687 bytes in 4,258 blocks
==19645==    still reachable: 471,235 bytes in 3,708 blocks
==19645==                       of which reachable via heuristic:
==19645==                         stdstring          : 2,575 bytes in 50 blocks
==19645==                         newarray           : 360 bytes in 1 blocks
==19645==         suppressed: 0 bytes in 0 blocks
==19645== Rerun with --leak-check=full to see details of leaked memory
==19645== 
==19645== For lists of detected and suppressed errors, rerun with: -s
==19645== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)
byllyfish commented 1 year ago

There's an assert fail in the debug build of python 3.11.1, when printing information about the cython task.

Edit: The ResourceWarning is due to upb violating the defensive assertion added in https://github.com/python/cpython/issues/95324

$ valgrind python -X dev demo.py 
==34680== Memcheck, a memory error detector
==34680== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==34680== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==34680== Command: python -X dev demo.py
==34680== 
gc:0: ResourceWarning: Object of type google._upb._message.MessageMapContainer is not untracked before destruction
gc:0: ResourceWarning: Object of type google._upb._message.MessageMapContainer is not untracked before destruction
gc:0: ResourceWarning: Object of type google._upb._message.MessageMapContainer is not untracked before destruction
gc:0: ResourceWarning: Object of type google._upb._message.MessageMapContainer is not untracked before destruction
gc:0: ResourceWarning: Object of type google._upb._message.ScalarMapContainer is not untracked before destruction
gc:0: ResourceWarning: Object of type google._upb._message.ScalarMapContainer is not untracked before destruction
gc:0: ResourceWarning: Object of type google._upb._message.ScalarMapContainer is not untracked before destruction
gc:0: ResourceWarning: Object of type google._upb._message.ScalarMapContainer is not untracked before destruction
gc:0: ResourceWarning: Object of type google._upb._message.ScalarMapContainer is not untracked before destruction
gc:0: ResourceWarning: Object of type google._upb._message.ScalarMapContainer is not untracked before destruction
gc:0: ResourceWarning: Object of type google._upb._message.ScalarMapContainer is not untracked before destruction
gc:0: ResourceWarning: Object of type google._upb._message.ScalarMapContainer is not untracked before destruction
Executing <Task pending name='Task-1' coro=<main() running at /home/bfish/code/finsy/demo.py:15> wait_for=<Future pending cb=[Task.task_wakeup()] created at /home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/base_events.py:427> cb=[_run_until_complete_cb() at /home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/base_events.py:180] created at /home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/runners.py:100> took 0.798 seconds
Executing <Task pending name='Task-1' coro=<main() running at /home/bfish/code/finsy/demo.py:19> wait_for=<Future pending cb=[Task.task_wakeup()] created at /home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/base_events.py:427> cb=[_run_until_complete_cb() at /home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/base_events.py:180] created at /home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/runners.py:100> took 0.319 seconds
Executing <Task pending name='Task-3' coro=<StreamStreamCall._prepare_rpc() running at /home/bfish/code/finsy/.venv/lib/python3.11/site-packages/grpc/aio/_call.py:643> wait_for=<Future pending cb=[Task.task_wakeup()] created at /home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/base_events.py:427> created at /home/bfish/code/finsy/.venv/lib/python3.11/site-packages/grpc/aio/_call.py:632> took 0.289 seconds
<Task pending name='Task-3' coro=<StreamStreamCall._prepare_rpc() running at /home/bfish/code/finsy/.venv/lib/python3.11/site-packages/grpc/aio/_call.py:643> wait_for=<Future pending cb=[Task.task_wakeup()] created at /home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/base_events.py:427> created at /home/bfish/code/finsy/.venv/lib/python3.11/site-packages/grpc/aio/_call.py:632>
python: Objects/codeobject.c:771: PyCode_Addr2Line: Assertion `addrq >= 0 && addrq < _PyCode_NBYTES(co)' failed.
Fatal Python error: Aborted

Thread 0x000000000abf6640 (most recent call first):
  File "/home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/threading.py", line 975 in run
  File "/home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/threading.py", line 995 in _bootstrap

Current thread 0x0000000004b72b80 (most recent call first):
  File "/home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/coroutines.py", line 104 in _format_coroutine
  File "/home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/base_tasks.py", line 18 in _task_repr_info
  File "/home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/base_tasks.py", line 28 in _task_repr
  File "/home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/reprlib.py", line 21 in wrapper
  File "/home/bfish/code/finsy/demo.py", line 10 in get_stack
  File "/home/bfish/code/finsy/demo.py", line 21 in main
  File "/home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/events.py", line 80 in _run
  File "/home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/base_events.py", line 1911 in _run_once
  File "/home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/base_events.py", line 607 in run_forever
  File "/home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/base_events.py", line 640 in run_until_complete
  File "/home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/runners.py", line 118 in run
  File "/home/bfish/.pyenv/versions/3.11.1-debug/lib/python3.11/asyncio/runners.py", line 190 in run
  File "/home/bfish/code/finsy/demo.py", line 24 in <module>

Extension modules: grpc._cython.cygrpc, google._upb._message (total: 2)
==34680== 
==34680== Process terminating with default action of signal 6 (SIGABRT)
==34680==    at 0x49E0A7C: __pthread_kill_implementation (pthread_kill.c:44)
==34680==    by 0x49E0A7C: __pthread_kill_internal (pthread_kill.c:78)
==34680==    by 0x49E0A7C: pthread_kill@@GLIBC_2.34 (pthread_kill.c:89)
==34680==    by 0x498C475: raise (raise.c:26)
==34680==    by 0x4E0986: faulthandler_fatal_error (faulthandler.c:385)
==34680==    by 0x498C51F: ??? (in /usr/lib/x86_64-linux-gnu/libc.so.6)
==34680==    by 0x49E0A7B: __pthread_kill_implementation (pthread_kill.c:43)
==34680==    by 0x49E0A7B: __pthread_kill_internal (pthread_kill.c:78)
==34680==    by 0x49E0A7B: pthread_kill@@GLIBC_2.34 (pthread_kill.c:89)
==34680== 
==34680== HEAP SUMMARY:
==34680==     in use at exit: 6,558,722 bytes in 8,085 blocks
==34680==   total heap usage: 19,966 allocs, 11,881 frees, 31,248,068 bytes allocated
==34680== 
==34680== LEAK SUMMARY:
==34680==    definitely lost: 0 bytes in 0 blocks
==34680==    indirectly lost: 0 bytes in 0 blocks
==34680==      possibly lost: 6,088,930 bytes in 4,389 blocks
==34680==    still reachable: 469,792 bytes in 3,696 blocks
==34680==                       of which reachable via heuristic:
==34680==                         stdstring          : 2,575 bytes in 50 blocks
==34680==                         newarray           : 360 bytes in 1 blocks
==34680==         suppressed: 0 bytes in 0 blocks
==34680== Rerun with --leak-check=full to see details of leaked memory
==34680== 
==34680== For lists of detected and suppressed errors, rerun with: -s
==34680== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Aborted (core dumped)
byllyfish commented 1 year ago

Fixed in Python 3.11.2. (https://github.com/python/cpython/issues/99110.)