Closed IamAbbey closed 11 months ago
Thanks for the report. This looks like a problem with python but we'll take a look to see if there's anything we can do to mitigate it from our end.
@Zylphrex it would be great if you could open a CPython issue and mention us if you conclude that there is nothing you are doing wrong at your end. I've pinged someone in https://github.com/python/cpython/issues/99110 who worked on similar issues in CPython 3.11 / 3.12, but so far got no response.
@zyv I haven't been able to identify anything that we can do from the SDK to avoid the segfault without disabling the profiler entirely.
Also, it would be great if you could provide some more info about your environment that may be helpful in reproducing this problem.
python -VV
pip freeze
Also some questions:
output of python -VV
Python 3.11.5 (main, Sep 7 2023, 18:53:16) [GCC 12.2.0]
your gunicorn config
gunicorn -b 0.0.0.0:8000 --workers 2 --threads 4 --keep-alive 5 config.wsgi --log-file - --access-logfile - --log-level debug --enable-stdio-inheritance
output of pip freeze
add-trailing-comma==3.0.1 amqp==5.1.1 annotated-types==0.5.0 asgiref==3.7.2 attrs==23.1.0 bcrypt==4.0.1 beautifulsoup4==4.12.2 billiard==4.1.0 black==23.7.0 boto3==1.28.37 botocore==1.31.37 Brotli==1.0.9 build==0.10.0 CacheControl==0.13.1 cachetools==5.3.1 celery==5.3.1 certifi==2023.7.22 cffi==1.15.1 charset-normalizer==3.2.0 cleo==2.0.1 click==8.1.7 click-didyoumean==0.3.0 click-plugins==1.1.1 click-repl==0.3.0 coverage==7.3.0 crashtest==0.4.1 cryptography==41.0.3 cssselect2==0.7.0 dataclasses-json==0.5.14 Deprecated==1.2.14 distlib==0.3.7 dj-database-url==2.1.0 Django==4.2.4 django-bootstrap-v5==1.0.11 django-cid==2.3 django-cors-headers==4.2.0 django-countries==7.5.1 django-debug-toolbar==4.2.0 django-oauth-toolkit==2.3.0 djangorestframework==3.14.0 dulwich==0.21.6 et-xmlfile==1.1.0 filelock==3.12.3 flake8==6.1.0 flake8-black==0.3.6 flake8-coding==1.3.2 flake8-isort==6.0.0 flake8-quotes==3.3.2 fonttools==4.42.1 gunicorn==21.2.0 html5lib==1.1 idna==3.4 importlib-metadata==6.8.0 iniconfig==2.0.0 installer==0.7.0 isort==5.12.0 jaraco.classes==3.3.0 jeepney==0.8.0 Jinja2==3.1.2 jmespath==1.0.1 jsonfield==3.1.0 jsonschema==4.17.3 jwcrypto==1.5.0 keyring==24.2.0 kombu==5.3.1 lxml==4.9.3 Markdown==3.4.4 MarkupSafe==2.1.3 marshmallow==3.20.1 mccabe==0.7.0 model-bakery==1.15.0 more-itertools==10.1.0 msgpack==1.0.5 mypy-extensions==1.0.0 oauthlib==3.2.2 openpyxl==3.1.2 oyaml==1.0 packaging==23.1 paramiko==3.3.1 pathspec==0.11.2 pexpect==4.8.0 Pillow==10.0.0 pkginfo==1.9.6 platformdirs==3.10.0 pluggy==1.3.0 poetry==1.6.1 poetry-core==1.7.0 poetry-plugin-export==1.5.0 prompt-toolkit==3.0.39 psycopg==3.1.10 psycopg-binary==3.1.10 ptyprocess==0.7.0 pycodestyle==2.11.0 pycparser==2.21 pydantic==2.3.0 pydantic_core==2.6.3 pydyf==0.7.0 pyflakes==3.1.0 pyhumps==3.8.0 PyJWT==2.8.0 PyNaCl==1.5.0 pyphen==0.14.0 pyproject_hooks==1.0.0 pyrsistent==0.19.3 PySocks==1.7.1 pytest==7.4.0 pytest-cov==4.1.0 pytest-django==4.5.2 python-dateutil==2.8.2 pytz==2023.3 PyYAML==6.0.1 rapidfuzz==2.15.1 redis==5.0.0 requests==2.31.0 requests-mock==1.11.0 requests-oauthlib==1.3.1 requests-toolbelt==1.0.0 s3transfer==0.6.2 SecretStorage==3.3.3 sentry-sdk==1.30.0 shellingham==1.5.3 six==1.16.0 soupsieve==2.4.1 sqlparse==0.4.4 tinycss2==1.2.1 tokenize-rt==5.2.0 tomlkit==0.12.1 tqdm==4.66.1 trove-classifiers==2023.8.7 typing-inspect==0.9.0 typing_extensions==4.7.1 tzdata==2023.3 urllib3==1.26.16 vine==5.0.0 virtualenv==20.24.5 watchtower==3.0.1 wcwidth==0.2.6 weasyprint==59.0 webencodings==0.5.1 whitenoise==6.5.0 wrapt==1.15.0 youtrack-sdk @ git+ssh://git@github.com/moneymeets/youtrack-sdk.git@4071e614e997be70c219abc6d21e1c33d759f5a6 zipp==3.16.2 zopfli==0.2.2
@Zylphrex
- Does this happen on app startup? Or does it happen after some time?
No, this happens after some time and so far we couldn't determine when & why exactly does it happen.
As you can see, we run gunicorn with multiple threads (and apparently Sentry profiler is running in a separate thread on top of that). The segfaults happen under load, if the load is low, then they might not happen for days.
We are also getting those segfaults in Celery background workers. This was not the case as we were using gevent, but due to extreme bugginess of gevent under Python 3.11 (https://github.com/gevent/gevent/issues/1985) we also had to switch to threads there...
- Was an interrupt sent to the app?
If you mean SIGINT to gunicorn, then no, it's not a factor.
Hi, we are also experiencing a segfault in a multi-thread context that appears to be coming from the sentry profiler:
Traceback for thread 63 [Has the GIL] (most recent call last):
(Python) File "/usr/local/lib/python3.11/threading.py", line 995, in _bootstrap
self._bootstrap_inner()
(Python) File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
self.run()
(Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/integrations/threading.py", line 67, in run
return old_run_func(self, *a, **kw)
(Python) File "/usr/local/lib/python3.11/threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
(Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 914, in run
self.sampler()
(Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 814, in _sample_stack
sample = [
<200b>
(Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 815, in <listcomp>
(str(tid), extract_stack(frame, cache, cwd))
(Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 276, in extract_stack
frame_ids = tuple(frame_id(raw_frame) for raw_frame in raw_frames)
(Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 276, in <genexpr>
frame_ids = tuple(frame_id(raw_frame) for raw_frame in raw_frames)
(Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 303, in frame_id
return (raw_frame.f_code.co_filename, raw_frame.f_lineno)
We're going to see if turning the profiler off stops it from happening.
Hey @fnmut ! Thanks for commenting, please keep us posted.
We are also facing similar issue. Even after disabling the profiler we are seeing this issue. It stops only when we disable sentry or disable the logger integration (loguru).
python -VV
Python 3.11.6 (main, Nov 29 2023, 04:47:02) [GCC 10.2.1 20210110]
pip freeze
aioboto3==10.4.0
aiobotocore==2.4.2
aiofile==3.8.8
aiofiles==23.2.1
aiohttp==3.9.1
aioitertools==0.11.0
aiokafka==0.9.0rc0
aiopath==0.6.11
aiosignal==1.3.1
alembic==1.12.1
anyio==3.7.1
arq==0.25.0
async-timeout==4.0.3
asyncpg==0.28.0
attrs==21.4.0
betterproto==2.0.0b6
black==23.11.0
boto3==1.24.59
botocore==1.27.59
cachetools==5.3.2
caio==0.9.13
cassandra-driver==3.28.0
certifi==2023.11.17
charset-normalizer==3.3.2
click==8.1.7
expiringdict==1.2.2
fastapi==0.95.2
fastapi-pagination==0.10.0
frozenlist==1.4.0
geomet==0.2.1.post1
google-api-core==2.14.0
google-api-python-client==2.109.0
google-auth==2.23.4
google-auth-httplib2==0.1.1
google-cloud-profiler==4.1.0
googleapis-common-protos==1.61.0
greenlet==3.0.1
grpc-stubs==1.53.0.3
grpcio==1.59.3
grpcio-tools==1.59.3
grpclib==0.4.6
h11==0.14.0
h2==4.1.0
hiredis==2.2.3
hpack==4.0.0
httplib2==0.22.0
httptools==0.6.1
hyperframe==6.0.1
idna==3.6
isort==5.12.0
Jinja2==3.1.2
jmespath==1.0.1
loguru==0.7.2
Mako==1.3.0
MarkupSafe==2.1.3
multidict==6.0.4
mypy-extensions==1.0.0
mypy-protobuf==3.5.0
packaging==23.2
pathspec==0.11.2
platformdirs==4.0.0
protobuf==4.25.1
psycopg2-binary==2.9.9
pyasn1==0.5.1
pyasn1-modules==0.3.0
pydantic==1.10.13
PyJWT==2.8.0
pyparsing==3.1.1
python-dateutil==2.8.2
python-dotenv==0.21.1
PyYAML==5.3
redis==5.0.1
requests==2.31.0
rsa==4.9
s3transfer==0.6.2
sentry-sdk==1.38.0
six==1.16.0
sniffio==1.3.0
SQLAlchemy==1.4.41
sqlalchemy2-stubs==0.0.2a37
sqlmodel==0.0.8
starlette==0.27.0
tenacity==8.2.3
types-protobuf==4.24.0.4
typing_extensions==4.8.0
uritemplate==4.1.1
urllib3==1.26.18
uvicorn==0.24.0.post1
uvloop==0.19.0
watchfiles==0.21.0
websockets==10.4
wrapt==1.16.0
yarl==1.9.3
We were forced to upgrade to Python 3.12 to get rid of this issue. Sadly found no solution with Python 3.11, but with Python 3.12 the issue seems to be fixed.
@Zylphrex do you have any clue why this could happen in Python 3.11 but not in 3.12?
@antonpirker I've linked a related CPython issue in Python 3.11 and they fixed more of those in Python 3.12... but apparently nobody was so impressed as to backport these fixes to Python 3.11.
@iamyoudi have to tried upgradint to Python 3.12 to get rid of the problem?
Thanks @zyv
So I guess there is nothing we can do on our end.
@zyv mentioned the problem in the CPython repo: https://github.com/python/cpython/issues/99110
The work around for this is to upgrade to Python 3.12.
I am closing this but will keep the comments open, so people can find this and maybe post future improvements in Python 3.11 or other workarounds.
@antonpirker, I work with @fnmut. We've upgraded to Python 3.12, turned profiler back on... and segfaults again. Latest stacktrace is
Traceback for thread 101 [Has the GIL] (most recent call last):
(Python) File "/usr/local/lib/python3.12/threading.py", line 1030, in _bootstrap
self._bootstrap_inner()
(Python) File "/usr/local/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
self.run()
(Python) File "/usr/local/lib/python3.12/site-packages/sentry_sdk/integrations/threading.py", line 70, in run
return old_run_func(self, *a, **kw)
(Python) File "/usr/local/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
(Python) File "/usr/local/lib/python3.12/site-packages/sentry_sdk/profiler.py", line 943, in run
self.sampler()
(Python) File "/usr/local/lib/python3.12/site-packages/sentry_sdk/profiler.py", line 829, in _sample_stack
(str(tid), extract_stack(frame, cache, cwd))
(Python) File "/usr/local/lib/python3.12/site-packages/sentry_sdk/profiler.py", line 270, in extract_stack
f_back = raw_frame.f_back
executable: dramatiq arguments: /usr/local/bin/python3.12 /usr/local/bin/dramatiq <task_file> -p 3 -t 12 -Q
Python version: 3.12.1
sentry-sdk 1.36.0
And the stacktrace according to gdb
warning: Build-id of /usr/local/bin/../lib/libpython3.12.so.1.0 does not match core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/bin/python3.12 /usr/local/bin/dramatiq baton_tms.tasks -p 3 -t 12 -Q'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fc63515cbe1 in local_clear (self=0x7fc634a57100) at Python/pystate.c:1968
1968 Python/pystate.c: No such file or directory.
[Current thread is 1 (Thread 0x7fc5db7fe6c0 (LWP 101))]
(gdb) bt
#0 0x00007fc63515cbe1 in local_clear (self=0x7fc634a57100) at Python/pystate.c:1968
#1 0x00007fc635421760 in PyFunction_Type () from /usr/local/bin/../lib/libpython3.12.so.1.0
#2 0x00007fc6350c99bd in _PyDictOrValues_GetValues (dorv=...) at ./Include/internal/pycore_object.h:389
#3 _PyObject_GenericGetAttrWithDict (obj=<unknown at remote 0x7fc635038f10>,
name=<type at remote 0x7fc635421760>, dict=<optimized out>, suppress=0) at Objects/object.c:1442
#4 0x00007fc6350ac7f5 in _PyObject_FastCallDictTstate (tstate=0x7fc61b2cf400, callable='f_back',
args=0x7fc635421760 <PyFrame_Type>, nargsf=140489272850848, kwargs=0x0) at Objects/call.c:149
#5 0x0000000000000000 in ?? ()
(gdb)
That's funny, it seems that f_back
stuff still wasn't completely fixed, but it works for us with threads. Are there any open CPython issues with respect to that?
What Python version are you running? I've been checking on this issue but I haven't found any new open issues in CPython that look related to this.
How do you use Sentry?
Sentry Saas (sentry.io)
Version
1.30.0
Steps to Reproduce
The issue is hard to reproduce exactly as it occur sparingly.
The issue occurs in our Django application and whenever it happens there is a server crash with: Example of an extract from log:
Worker (pid:33) was sent SIGSEGV!
The server start command is
gunicorn -b 0.0.0.0:8000 --workers 2 --threads 4 --keep-alive 5 config.wsgi --log-level info
From the logs attached below, it appears the server crashes when at a point sentry profiling resulted in
Segmentation fault
Expected Result
The sentry-sdk being able to do the expected profiling correctly and reporting them to Sentry.
Actual Result
Below are the logs from analysing the coredump with gdb