apache / pulsar-client-python

Apache Pulsar Python client library
https://pulsar.apache.org/
Apache License 2.0
51 stars 39 forks source link

Segfaults in Python 3.11 on UWSGI worker close #153

Open charlieevett opened 11 months ago

charlieevett commented 11 months ago

We are using Python Pulsar producer clients in a web app that runs in UWSGI. We have our settings set to terminate and relaunch workers after a certain number of requests. We recently upgrade to Python 3.11 and the newest Pulsar client (3.3.0). At that point we started to see segfaults as the worker process was exiting. Here is the traceback that gets logged:

!!! uWSGI process 85 got Segmentation Fault !!! *** backtrace of 85 *** uwsgi(uwsgi_backtrace+0x2f) [0x556ab933fcaf] uwsgi(uwsgi_segfault+0x23) [0x556ab9340083] /lib/x86_64-linux-gnu/libc.so.6(+0x3bfd0) [0x7f845c457fd0] /usr/local/lib/python3.11/site-packages/pulsar_client.libs/libpulsar-2dce0d9f.so(+0x2b387a) [0x7f845770087a] /usr/local/lib/python3.11/site-packages/_pulsar.cpython-311-x86_64-linux-gnu.so(+0xaf752) [0x7f8457f99752] /usr/local/lib/python3.11/site-packages/_pulsar.cpython-311-x86_64-linux-gnu.so(+0x5399e) [0x7f8457f3d99e] /usr/local/lib/python3.11/site-packages/_pulsar.cpython-311-x86_64-linux-gnu.so(+0x545ef) [0x7f8457f3e5ef] /usr/local/lib/libpython3.11.so.1.0(+0x228f8a) [0x7f845c825f8a] /usr/local/lib/libpython3.11.so.1.0(+0x1f4467) [0x7f845c7f1467] /usr/local/lib/libpython3.11.so.1.0(+0x1f4030) [0x7f845c7f1030] /usr/local/lib/libpython3.11.so.1.0(+0x29941f) [0x7f845c89641f] /usr/local/lib/libpython3.11.so.1.0(+0x2993c5) [0x7f845c8963c5] /usr/local/lib/libpython3.11.so.1.0(+0x1ed587) [0x7f845c7ea587] /usr/local/lib/libpython3.11.so.1.0(_PyModule_ClearDict+0x115) [0x7f845c82bfe5] /usr/local/lib/libpython3.11.so.1.0(+0x278bb6) [0x7f845c875bb6] /usr/local/lib/libpython3.11.so.1.0(Py_FinalizeEx+0x12b) [0x7f845c8616ab] uwsgi(uwsgi_plugins_atexit+0x71) [0x556ab933e121] /lib/x86_64-linux-gnu/libc.so.6(+0x3e4dd) [0x7f845c45a4dd] /lib/x86_64-linux-gnu/libc.so.6(+0x3e61a) [0x7f845c45a61a] uwsgi(+0x3df6f) [0x556ab92f5f6f] uwsgi(simple_goodbye_cruel_world+0x59) [0x556ab933f619] uwsgi(+0x87658) [0x556ab933f658] uwsgi(uwsgi_close_request+0x4d9) [0x556ab92f6bf9] uwsgi(simple_loop_run+0xc0) [0x556ab933bbc0] uwsgi(simple_loop+0x7b) [0x556ab933bc8b] uwsgi(uwsgi_ignition+0x279) [0x556ab93403b9] uwsgi(uwsgi_worker_run+0x25b) [0x556ab93449db] uwsgi(uwsgi_run+0x43a) [0x556ab9344f2a] uwsgi(+0x3aede) [0x556ab92f2ede] /lib/x86_64-linux-gnu/libc.so.6(+0x271ca) [0x7f845c4431ca] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f845c443285] uwsgi(_start+0x21) [0x556ab92f2f01] *** end of backtrace ***

The resulting core files are filling up the server disks. We could ulimit them but we'd prefer to avoid that.

BewareMyPower commented 11 months ago

Is there a simple (hand-to-hand) way to reproduce? Could it be reproduced against 3.2.0?

charlieevett commented 11 months ago

We did try 3.2.0 with the same results. We see the issue frequently, but I'm not sure how to reproduce on command, since UWSGI is initiating the process shutdown.

charlieevett commented 11 months ago

It worked fine with Python 3.7 and the 3.0 client.

gromsterus commented 11 months ago

@charlieevett

  1. How do you run uwsgi? Could you share an example config?
  2. How do you initiate and close a pulsar client? Could you share a code example?
charlieevett commented 11 months ago

Sure, here is most of the uwsgi.ini file we use:

[uwsgi]
http = :$(PORT)
add-header = Connection: close
master = true
processes = $(PROCESSES)
die-on-term = true
memory-report = true
vacuum=True
log-x-forwarded-for=True
log-format = %(addr) - %(user) [%(ltime)] "%(method) %(uri) %(proto)" %(status) %(size) ("%(referer)" "%(uagent)" | %(msecs)
listen = 128
threads = true
max-requests = 750                  ; Restart workers after this many requests
max-worker-lifetime = 1800          ; Restart workers after this many seconds
reload-on-rss = 256                 ; Restart workers after this much resident memory
worker-reload-mercy = 60            ; How long to wait before forcefully killing workers

uwsgi is run using a very basic command in a Docker container:

$ uwsgi uwsgi.ini

The issue we're seeing is with the max-requests or max-worker-lifetime handling, when these limits are hit and uwsgi kills the worker process, it will often segfault during the exit. Relaunching process is fine, so the main issue is the file system filling up with core files.

gromsterus commented 11 months ago

I reproduced the bug. It's similar to https://github.com/unbit/uwsgi/issues/1969.

I see two possible fast solutions:

  1. Mv from dict state to object (code). Now segfaults gone:
    ...The work of process 32 is done. Seeya!
    worker 2 killed successfully (pid: 32)
    Respawned uWSGI worker 2 (new pid: 38)
  2. Mv from uwsgi to gunicorn (code, conf). Gunicorn is fast and simple server. In this case, it gives more control over the life cycle of workers with hooks.
charlieevett commented 11 months ago

Thank you! We'll look into the workarounds, both seem plausible.

ryanbagwell commented 11 months ago

@gromsterus Many thanks for the investigation and suggested workarounds. Just wanted to mention the following in case it's helpful:

Thread 2 (Thread 0x7f479172e7c0 (LWP 55)):
#0  0x00007f4793a040c9 in _int_free (av=0x7f4793b40c60 <main_arena>, p=<optimized out>, have_lock=<optimized out>, have_lock@entry=0) at ./malloc/malloc.c:4698
#1  0x00007f4793a06d9f in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3385
#2  0x00007f477dd7d869 in ?? () from /lib/x86_64-linux-gnu/libgdal.so.32
#3  0x00007f477dd816ab in std::map<char, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<char>, std::allocator<std::pair<char const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::~map() () from /lib/x86_64-linux-gnu/libgdal.so.32
#4  0x00007f47939ac4dd in __run_exit_handlers (status=1, listp=0x7f4793b40820 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:116
#5  0x00007f47939ac61a in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:146
#6  0x0000561a24866f6f in uwsgi_exit ()
#7  0x0000561a248b10b2 in uwsgi_segfault ()
#8  <signal handler called>
#9  0x0000561a289e5730 in ?? ()
#10 0x00007f478ed8c87d in pulsar::ProducerImpl::~ProducerImpl() () from /usr/local/lib/python3.11/site-packages/pulsar_client.libs/libpulsar-2dce0d9f.so
#11 0x00007f478f625752 in pybind11::class_<pulsar::Producer>::dealloc(pybind11::detail::value_and_holder&) () from /usr/local/lib/python3.11/site-packages/_pulsar.cpython-311-x86_64-linux-gnu.so
#12 0x00007f478f5c999e in pybind11::detail::clear_instance(_object*) () from /usr/local/lib/python3.11/site-packages/_pulsar.cpython-311-x86_64-linux-gnu.so
#13 0x00007f478f5ca5ef in pybind11_object_dealloc () from /usr/local/lib/python3.11/site-packages/_pulsar.cpython-311-x86_64-linux-gnu.so
#14 0x00007f4793d77f8a in _PyType_SUPPORTS_WEAKREFS (type=0x561a280e3bc0) at ./Include/internal/pycore_object.h:242
#15 PyObject_ClearWeakRefs (object=0x561a28244760) at Objects/weakrefobject.c:948
#16 0x00007f479400ad00 in _Py_TrueStruct () from /usr/local/lib/libpython3.11.so.1.0
#17 0x0000000000000000 in ?? ()