confluentinc / confluent-kafka-python

Confluent's Kafka Python Client
http://docs.confluent.io/current/clients/confluent-kafka-python
Other
3.72k stars 882 forks source link

Segmentation fault on GC runs in multithreaded context #1761

Open progval opened 4 weeks ago

progval commented 4 weeks ago

Description

I am getting the following crash in very specific conditions, while the Python GC is running.

I can reliably reproduce it in a pytest test suite of an (unfortunately) non-trivial code base. Each test of the suite starts a producer, writes a few messages, then starts a multiprocessing.Manager, then a consumer and consumes topics to the end.

The third case to run segfaults when I manually trigger the GC in a specific place (or the fourth, when the GC eventually runs on its own).

==3552160== Invalid read of size 4
==3552160==    at 0x4A28B17: __pthread_clockjoin_ex (pthread_join_common.c:43)
==3552160==    by 0x4A2F96F: thrd_join@@GLIBC_2.34 (thrd_join.c:26)
==3552160==    by 0x774B3DB: rd_kafka_destroy_app (in /home/dev/.local/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-55260171.so.1)
==3552160==    by 0x75A86C3: Producer_dealloc (Producer.c:103)
==3552160==    by 0x5CCB14: UnknownInlinedFun (object.c:2390)
==3552160==    by 0x5CCB14: UnknownInlinedFun (object.h:538)
==3552160==    by 0x5CCB14: UnknownInlinedFun (dictobject.c:5566)
==3552160==    by 0x5CCB14: subtype_clear.lto_priv.0 (typeobject.c:1288)
==3552160==    by 0x50ABCD: UnknownInlinedFun (gcmodule.c:1013)
==3552160==    by 0x50ABCD: gc_collect_main.lto_priv.0 (gcmodule.c:1287)
==3552160==    by 0x6303E6: gc_collect_with_callback.lto_priv.0 (gcmodule.c:1400)
==3552160==    by 0x6756CF: UnknownInlinedFun (gcmodule.c:1538)
==3552160==    by 0x6756CF: gc_collect.lto_priv.0 (gcmodule.c.h:110)
==3552160==    by 0x53ACF6: cfunction_vectorcall_FASTCALL_KEYWORDS.lto_priv.0 (methodobject.c:443)
==3552160==    by 0x53AC2B: UnknownInlinedFun (pycore_call.h:92)
==3552160==    by 0x53AC2B: PyObject_Vectorcall (call.c:299)
==3552160==    by 0x52B93F: _PyEval_EvalFrameDefault (ceval.c:4772)
==3552160==    by 0x5855A3: UnknownInlinedFun (pycore_ceval.h:73)
==3552160==    by 0x5855A3: UnknownInlinedFun (ceval.c:6435)
==3552160==    by 0x5855A3: UnknownInlinedFun (call.c:393)
==3552160==    by 0x5855A3: _PyObject_VectorcallTstate.lto_priv.0 (pycore_call.h:92)
==3552160==  Address 0x1201f990 is not stack'd, malloc'd or (recently) free'd
==3552160== 
Fatal Python error: Segmentation fault

Current thread 0x0000000004b80880 (most recent call first):
  Garbage-collecting
  File "/home/dev/swh-environment/swh-dataset/build/__editable__.swh.dataset-1.5.2.dev7+g4fccea4.d20240523-py3-none-any/swh/dataset/journalprocessor.py", line 348 in progress_worker
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 257 in _process_worker
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 19 in __init__
  File "/usr/lib/python3.11/multiprocessing/context.py", line 281 in _Popen
  File "/usr/lib/python3.11/multiprocessing/process.py", line 121 in start
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 761 in _spawn_process
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 750 in _launch_processes
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 723 in _start_executor_manager_thread
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 785 in submit
  File "/home/dev/swh-environment/swh-dataset/build/__editable__.swh.dataset-1.5.2.dev7+g4fccea4.d20240523-py3-none-any/swh/dataset/journalprocessor.py", line 316 in run
  File "/home/dev/swh-environment/swh-dataset/swh/dataset/test/test_journal_processor.py", line 170 in test_parallel_journal_processor_origin_visit_status
  File "/home/dev/swh-environment/swh-dataset/swh/dataset/test/test_journal_processor.py", line 50 in newf
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/python.py", line 1772 in runtest
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 172 in pytest_runtest_call
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 240 in <lambda>
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 340 in from_call
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 239 in call_and_report
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 134 in runtestprotocol
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 115 in pytest_runtest_protocol
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 364 in pytest_runtestloop
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 339 in _main
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 285 in wrap_session
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 332 in pytest_cmdline_main
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/config/__init__.py", line 174 in main
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/config/__init__.py", line 197 in console_main
  File "/home/dev/.local/bin/pytest", line 8 in <module>

Extension modules: confluent_kafka.cimpl, msgpack._cmsgpack, psutil._psutil_linux, psutil._psutil_posix, psycopg_c.pq, psycopg_c._psycopg, markupsafe._speedups, _brotli, simplejson._speedups, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, tornado.speedups, plyvel._plyvel (total: 20)
==3552160== 
==3552160== Process terminating with default action of signal 11 (SIGSEGV)
==3552160==    at 0x4A28E2C: __pthread_kill_implementation (pthread_kill.c:44)
==3552160==    by 0x49D9FB1: raise (raise.c:26)
==3552160==    by 0x49DA04F: ??? (in /usr/lib/x86_64-linux-gnu/libc.so.6)
==3552160==    by 0x4A28B16: __pthread_clockjoin_ex (pthread_join_common.c:43)
==3552160== 
==3552160== HEAP SUMMARY:
==3552160==     in use at exit: 29,294,834 bytes in 30,069 blocks
==3552160==   total heap usage: 219,797 allocs, 189,728 frees, 522,042,917 bytes allocated
==3552160== 
==3552160== LEAK SUMMARY:
==3552160==    definitely lost: 4,547 bytes in 25 blocks
==3552160==    indirectly lost: 12,610 bytes in 30 blocks
==3552160==      possibly lost: 423,923 bytes in 214 blocks
==3552160==    still reachable: 28,853,754 bytes in 29,800 blocks
==3552160==         suppressed: 0 bytes in 0 blocks
==3552160== Rerun with --leak-check=full to see details of leaked memory
==3552160== 
==3552160== For lists of detected and suppressed errors, rerun with: -s
==3552160== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==3552159== 
==3552159== Process terminating with default action of signal 15 (SIGTERM)
==3552159==    at 0x4A23E96: __futex_abstimed_wait_common64 (futex-internal.c:57)
==3552159==    by 0x4A23E96: __futex_abstimed_wait_common (futex-internal.c:87)
==3552159==    by 0x4A2683B: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==3552159==    by 0x4A2683B: pthread_cond_timedwait@@GLIBC_2.3.2 (pthread_cond_wait.c:643)
==3552159==    by 0x4A2F5D8: cnd_timedwait@@GLIBC_2.34 (cnd_timedwait.c:27)
==3552159==    by 0x7782B10: rd_kafka_q_serve_rkmessages (in /home/dev/.local/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-55260171.so.1)
==3552159==    by 0x7743C3B: rd_kafka_consume_batch_queue (in /home/dev/.local/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-55260171.so.1)
==3552159==    by 0x75A5BB6: Consumer_consume (Consumer.c:1056)
==3552159==    by 0x550BBD: method_vectorcall_VARARGS_KEYWORDS.lto_priv.0 (descrobject.c:364)
==3552159==    by 0x53AC2B: UnknownInlinedFun (pycore_call.h:92)
==3552159==    by 0x53AC2B: PyObject_Vectorcall (call.c:299)
==3552159==    by 0x52B93F: _PyEval_EvalFrameDefault (ceval.c:4772)
==3552159==    by 0x5855A3: UnknownInlinedFun (pycore_ceval.h:73)
==3552159==    by 0x5855A3: UnknownInlinedFun (ceval.c:6435)
==3552159==    by 0x5855A3: UnknownInlinedFun (call.c:393)
==3552159==    by 0x5855A3: _PyObject_VectorcallTstate.lto_priv.0 (pycore_call.h:92)
==3552159==    by 0x585173: method_vectorcall.lto_priv.0 (classobject.c:59)
==3552159==    by 0x56A380: UnknownInlinedFun (call.c:257)
==3552159==    by 0x56A380: UnknownInlinedFun (call.c:328)
==3552159==    by 0x56A380: PyObject_Call (call.c:355)
==3552159== 
==3552159== HEAP SUMMARY:
==3552159==     in use at exit: 29,630,515 bytes in 30,288 blocks
==3552159==   total heap usage: 220,309 allocs, 190,021 frees, 522,280,354 bytes allocated
==3552159== 
==3552143== 
==3552143== HEAP SUMMARY:
==3552143==     in use at exit: 29,243,757 bytes in 29,955 blocks
==3552143==   total heap usage: 218,912 allocs, 188,957 frees, 522,173,350 bytes allocated
==3552143== 
==3552143== LEAK SUMMARY:
==3552143==    definitely lost: 576 bytes in 18 blocks
==3552143==    indirectly lost: 0 bytes in 0 blocks
==3552143==      possibly lost: 371,032 bytes in 200 blocks
==3552143==    still reachable: 28,872,149 bytes in 29,737 blocks
==3552143==         suppressed: 0 bytes in 0 blocks
==3552143== Rerun with --leak-check=full to see details of leaked memory
==3552143== 
==3552143== For lists of detected and suppressed errors, rerun with: -s
==3552143== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==3552159== LEAK SUMMARY:
==3552159==    definitely lost: 608 bytes in 19 blocks
==3552159==    indirectly lost: 0 bytes in 0 blocks
==3552159==      possibly lost: 676,795 bytes in 245 blocks
==3552159==    still reachable: 28,953,112 bytes in 30,024 blocks
==3552159==         suppressed: 0 bytes in 0 blocks
==3552159== Rerun with --leak-check=full to see details of leaked memory
==3552159== 
==3552159== For lists of detected and suppressed errors, rerun with: -s
==3552159== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

How to reproduce

I'm afraid I was not able to reproduce it with a simple script.

Clone https://gitlab.softwareheritage.org/swh/devel/swh-dataset.git , check out f8b5c8665ac802f8b9dd15feb11dad2c2eadaee3, then applying this patch (segfaults also happen without it, but it makes them more deterministic):

diff --git a/swh/dataset/journalprocessor.py b/swh/dataset/journalprocessor.py
index ab62cb1..87bdb32 100644
--- a/swh/dataset/journalprocessor.py
+++ b/swh/dataset/journalprocessor.py
@@ -24,6 +24,7 @@
     Set,
     Tuple,
 )
+import gc

 from confluent_kafka import Message, TopicPartition
 import tqdm
@@ -344,6 +345,7 @@ def progress_worker(self, queue=None) -> None:
         active_workers = self.processes
         offset_diff = sum((hi - lo) for lo, hi in self.offsets.values())
         desc = "  - Export"
+        gc.collect()
         with tqdm.tqdm(total=offset_diff, desc=desc, unit_scale=True) as pbar:
             while active_workers:
                 item = queue.get()
Installed Python packages ``` androguard==3.4.0a1 apksigcopier==1.1.1 appdirs==1.4.4 argcomplete==2.0.0 asn1crypto==1.5.1 asttokens==2.2.1 attrs==22.2.0 backcall==0.2.0 beautifulsoup4==4.11.2 beniget==0.4.1 binwalk==2.3.3 black==22.10.0 Brotli==1.0.9 cassandra-pylib==0.0.0 certifi==2022.9.24 cfgv==3.3.1 chardet==5.1.0 charset-normalizer==3.0.1 click==8.1.3 colorama==0.4.6 contourpy==1.0.7 cryptography==38.0.4 cssselect==1.2.0 cupshelpers==1.0 cycler==0.11.0 dbus-python==1.3.2 decorator==5.1.1 defusedxml==0.7.1 diffoscope==240 distlib==0.3.6 distro==1.8.0 dulwich==0.21.2 EditorConfig==0.12.3 elementpath==2.5.3 enjarify==1.0.3 executing==1.2.0 fastimport==0.9.14 filelock==3.9.0 fonttools==4.38.0 fs==2.4.16 gast==0.5.2 GDAL==3.6.2 gunicorn==20.1.0 html5lib==1.1 httpie==3.2.1 httplib2==0.20.4 identify==2.5.17 idna==3.3 importlib-metadata==4.12.0 iotop==0.6 ipython==8.5.0 jedi==0.18.2 jsbeautifier==1.14.4 jsondiff==2.0.0 kiwisolver==0.0.0 libarchive-c==2.9 libevdev==0.5 lxml==4.9.2 lz4==4.0.2+dfsg markdown-it-py==2.1.0 matplotlib==3.6.3 matplotlib-inline==0.1.6 mdurl==0.1.2 mechanize==0.4.8 more-itertools==8.10.0 mpmath==0.0.0 msgpack==1.0.3 multidict==6.0.4 mypy-extensions==1.0.0 netifaces==0.11.0 networkx==2.8.8 nodeenv==0.13.4 numpy==1.24.2 olefile==0.46 oscrypto==1.3.0 packaging==23.0 parso==0.8.3 pathspec==0.11.1 pbr==5.10.0 pdfminer.six==20221105 pexpect==4.8.0 pickleshare==0.7.5 Pillow==9.4.0 platformdirs==2.6.0 ply==3.11 powerline-status==2.8.3 pre-commit==3.0.4 progressbar==2.5 prompt-toolkit==3.0.36 psutil==5.9.4 ptyprocess==0.7.0 pure-eval==0.0.0 pycairo==1.20.1 pycryptodomex==3.11.0 pycups==2.0.1 pycurl==7.45.2 pydot==1.4.2 Pygments==2.14.0 PyGObject==3.42.2 pygraphviz==1.7 PyOpenGL==3.1.6 pyparsing==3.0.9 pypdf==3.4.1 pyperclip==1.8.2 pyprof2calltree==1.4.5 PyQt5==5.15.9 PyQt5-sip==12.11.1 PyQt6==6.4.2 PyQt6-sip==13.4.1 pyqtgraph==0.13.1 PySimpleSOAP==1.16.2 pysmbc==1.0.23 PySocks==1.7.1 pyte==0.8.0 python-apt==2.6.0 python-dateutil==2.8.2 python-debian==0.1.49 python-debianbts==4.0.1 python-magic==0.4.26 pythran==0.11.0 pytz==2022.7.1 pyudev==0.24.0 pyxattr==0.8.1 PyYAML==6.0 reportbug==12.0.0 requests==2.28.1 requests-toolbelt==0.10.1 rich==13.3.1 rpm==4.18.0 rubber==1.6.0 scipy==1.10.1 scour==0.38.2 six==1.16.0 soupsieve==2.3.2 stack-data==0.6.2 stevedore==4.0.2 sympy==1.11.1 tlsh==0.2.0 tqdm==4.64.1 traitlets==5.5.0 ufoLib2==0.14.0 unicodedata2==15.0.0 urllib3==1.26.12 virtualenv==20.17.1+ds virtualenv-clone==0.3.0 virtualenvwrapper==4.8.4 wcwidth==0.2.5 webencodings==0.5.1 xkcdpass==1.19.3 xmlschema==1.10.0 xmltodict==0.13.0 zipp==1.0.0 zstd==1.5.2.5 ```

Checklist

Please provide the following information:

KudoKhang commented 2 weeks ago

Interested 🗿