getsentry / sentry-python

The official Python SDK for Sentry.io
https://sentry.io/for/python/
MIT License
1.9k stars 502 forks source link

Segmentation fault in sentry profiler #2386

Closed IamAbbey closed 11 months ago

IamAbbey commented 1 year ago

How do you use Sentry?

Sentry Saas (sentry.io)

Version

1.30.0

Steps to Reproduce

The issue is hard to reproduce exactly as it occur sparingly.

The issue occurs in our Django application and whenever it happens there is a server crash with: Example of an extract from log: Worker (pid:33) was sent SIGSEGV!

The server start command is gunicorn -b 0.0.0.0:8000 --workers 2 --threads 4 --keep-alive 5 config.wsgi --log-level info

From the logs attached below, it appears the server crashes when at a point sentry profiling resulted in Segmentation fault

Python Version: Python 3.11.5

Sentry SDK settings:

sentry_sdk.init(
    dsn=get_env_value("SENTRY_DSN"),
    # captures 100% of transactions for performance monitoring.
    # but we make further decisions in `before_send_transaction`
    traces_sample_rate=1.0,
    # profile 100% of sampled transactions.
    profiles_sample_rate=1.0,
    before_send_transaction=sentry_before_send_transaction,
    integrations=[DjangoIntegration()],
    # Associate django.contrib.auth users to errors
    send_default_pii=True,
    environment=get_env_value("SENTRY_ENVIRONMENT"),
)

Expected Result

The sentry-sdk being able to do the expected profiling correctly and reporting them to Sentry.

Actual Result

Below are the logs from analysing the coredump with gdb

(gdb) py-bt
Traceback (most recent call first):
  File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 273, in extract_stack
    f_back = raw_frame.f_back
  File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 832, in <listcomp>
    (str(tid), extract_stack(frame, cache, cwd))
  File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 831, in _sample_stack
    sample = [
  File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 931, in run
    self.sampler()
  File "/usr/local/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.11/site-packages/sentry_sdk/integrations/threading.py", line 70, in run
    return old_run_func(self, *a, **kw)
  File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.11/threading.py", line 995, in _bootstrap
    self._bootstrap_inner()
(gdb) py-list
 268        """
 269    
 270        raw_frames = deque(maxlen=max_stack_depth)  # type: Deque[FrameType]
 271    
 272        while raw_frame is not None:
>273            f_back = raw_frame.f_back
 274            raw_frames.append(raw_frame)
 275            raw_frame = f_back
 276    
 277        frame_ids = tuple(frame_id(raw_frame) for raw_frame in raw_frames)
 278        frames = []
#0  0x0000ffffa0263a8c in _PyFrame_IsIncomplete (frame=0xffff9fa9b2f8) at ./Include/internal/pycore_frame.h:151
#1  PyFrame_GetBack (frame=<optimized out>) at Objects/frameobject.c:1328
#2  0x0000ffffa027073c in frame_getback (f=<optimized out>, closure=<optimized out>) at Objects/frameobject.c:103
#3  0x0000ffffa0195730 in _PyObject_GenericGetAttrWithDict
    (obj=Frame 0xffff9fa9b378, for file /usr/local/lib/python3.11/site-packages/typing_inspect.py, line 232, in is_union_type (tp=<type at remote 0xffffa04568d8>), name='f_back', dict=0x0, suppress=0) at ./Include/object.h:133
#4  0x0000ffffa017915c in PyObject_GenericGetAttr (name='f_back', obj=Frame 0xffff9fa9b378, for file /usr/local/lib/python3.11/site-packages/typing_inspect.py, line 232, in is_union_type (tp=<type at remote 0xffffa04568d8>))
    at Objects/object.c:1368
#5  PyObject_GetAttr (v=Frame 0xffff9fa9b378, for file /usr/local/lib/python3.11/site-packages/typing_inspect.py, line 232, in is_union_type (tp=<type at remote 0xffffa04568d8>), name='f_back') at Objects/object.c:916
#6  0x0000ffffa0182870 in _PyEval_EvalFrameDefault (tstate=0xffff90001ed0, frame=0xffff9fe18428, throwflag=-1614065056) at Python/ceval.c:3466
#7  0x0000ffffa01805d4 in _PyEval_EvalFrame (throwflag=0, frame=0xffff9fe18230, tstate=0xffff90001ed0) at ./Include/internal/pycore_ceval.h:73
#8  _PyEval_Vector (tstate=0xffff90001ed0, func=func@entry=0xffff9d1ecae0, locals=locals@entry=0x0, args=args@entry=0xffff8fffe4f8, argcount=argcount@entry=1, kwnames=kwnames@entry=0x0) at Python/ceval.c:6439
#9  0x0000ffffa01b8b1c in _PyFunction_Vectorcall (kwnames=0x0, nargsf=1, stack=0xffff8fffe4f8, func=<function at remote 0xffff9d1ecae0>) at Objects/call.c:393
#10 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=1, args=0xffff8fffe4f8, callable=<function at remote 0xffff9d1ecae0>, tstate=0xffff90001ed0) at ./Include/internal/pycore_call.h:92
#11 method_vectorcall (method=<optimized out>, args=0xffffa04886f8 <_PyRuntime+58944>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/classobject.c:67
#12 0x0000ffffa0184414 in _PyEval_EvalFrameDefault (tstate=0xffff90001ed0, frame=0xffff9fe181b8, throwflag=-1614065056) at Python/ceval.c:5381
#13 0x0000ffffa01805d4 in _PyEval_EvalFrame (throwflag=0, frame=0xffff9fe181b8, tstate=0xffff90001ed0) at ./Include/internal/pycore_ceval.h:73
#14 _PyEval_Vector (tstate=0xffff90001ed0, func=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:6439
#15 0x0000ffffa0184414 in _PyEval_EvalFrameDefault (tstate=0xffff90001ed0, frame=0xffff9fe18110, throwflag=-1614065056) at Python/ceval.c:5381
#16 0x0000ffffa01805d4 in _PyEval_EvalFrame (throwflag=0, frame=0xffff9fe18020, tstate=0xffff90001ed0) at ./Include/internal/pycore_ceval.h:73
#17 _PyEval_Vector (tstate=0xffff90001ed0, func=func@entry=0xffff9f68a020, locals=locals@entry=0x0, args=args@entry=0xffff8fffe8c8, argcount=argcount@entry=1, kwnames=kwnames@entry=0x0) at Python/ceval.c:6439
#18 0x0000ffffa01b8b1c in _PyFunction_Vectorcall (kwnames=0x0, nargsf=1, stack=0xffff8fffe8c8, func=<function at remote 0xffff9f68a020>) at Objects/call.c:393
#19 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=1, args=0xffff8fffe8c8, callable=<function at remote 0xffff9f68a020>, tstate=0xffff90001ed0) at ./Include/internal/pycore_call.h:92
#20 method_vectorcall (method=<optimized out>, args=0xffffa04886f8 <_PyRuntime+58944>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/classobject.c:67
#21 0x0000ffffa026b230 in thread_run (boot_raw=0xffff95890f90) at ./Modules/_threadmodule.c:1092
#22 0x0000ffffa026b1d0 in pythread_wrapper (arg=<optimized out>) at Python/thread_pthread.h:241
#23 0x0000ffff9fe9ee18 in start_thread (arg=0xffff9555d537) at ./nptl/pthread_create.c:442
#24 0x0000ffff9ff07e9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79
Zylphrex commented 1 year ago

Thanks for the report. This looks like a problem with python but we'll take a look to see if there's anything we can do to mitigate it from our end.

zyv commented 1 year ago

@Zylphrex it would be great if you could open a CPython issue and mention us if you conclude that there is nothing you are doing wrong at your end. I've pinged someone in https://github.com/python/cpython/issues/99110 who worked on similar issues in CPython 3.11 / 3.12, but so far got no response.

Zylphrex commented 1 year ago

@zyv I haven't been able to identify anything that we can do from the SDK to avoid the segfault without disabling the profiler entirely.

Also, it would be great if you could provide some more info about your environment that may be helpful in reproducing this problem.

  1. output of python -VV
  2. your gunicorn config
  3. output of pip freeze

Also some questions:

  1. Does this happen on app startup? Or does it happen after some time?
  2. Was an interrupt sent to the app?
catcombo commented 1 year ago

output of python -VV

Python 3.11.5 (main, Sep 7 2023, 18:53:16) [GCC 12.2.0]

your gunicorn config

gunicorn -b 0.0.0.0:8000 --workers 2 --threads 4 --keep-alive 5 config.wsgi --log-file - --access-logfile - --log-level debug --enable-stdio-inheritance

output of pip freeze

add-trailing-comma==3.0.1 amqp==5.1.1 annotated-types==0.5.0 asgiref==3.7.2 attrs==23.1.0 bcrypt==4.0.1 beautifulsoup4==4.12.2 billiard==4.1.0 black==23.7.0 boto3==1.28.37 botocore==1.31.37 Brotli==1.0.9 build==0.10.0 CacheControl==0.13.1 cachetools==5.3.1 celery==5.3.1 certifi==2023.7.22 cffi==1.15.1 charset-normalizer==3.2.0 cleo==2.0.1 click==8.1.7 click-didyoumean==0.3.0 click-plugins==1.1.1 click-repl==0.3.0 coverage==7.3.0 crashtest==0.4.1 cryptography==41.0.3 cssselect2==0.7.0 dataclasses-json==0.5.14 Deprecated==1.2.14 distlib==0.3.7 dj-database-url==2.1.0 Django==4.2.4 django-bootstrap-v5==1.0.11 django-cid==2.3 django-cors-headers==4.2.0 django-countries==7.5.1 django-debug-toolbar==4.2.0 django-oauth-toolkit==2.3.0 djangorestframework==3.14.0 dulwich==0.21.6 et-xmlfile==1.1.0 filelock==3.12.3 flake8==6.1.0 flake8-black==0.3.6 flake8-coding==1.3.2 flake8-isort==6.0.0 flake8-quotes==3.3.2 fonttools==4.42.1 gunicorn==21.2.0 html5lib==1.1 idna==3.4 importlib-metadata==6.8.0 iniconfig==2.0.0 installer==0.7.0 isort==5.12.0 jaraco.classes==3.3.0 jeepney==0.8.0 Jinja2==3.1.2 jmespath==1.0.1 jsonfield==3.1.0 jsonschema==4.17.3 jwcrypto==1.5.0 keyring==24.2.0 kombu==5.3.1 lxml==4.9.3 Markdown==3.4.4 MarkupSafe==2.1.3 marshmallow==3.20.1 mccabe==0.7.0 model-bakery==1.15.0 more-itertools==10.1.0 msgpack==1.0.5 mypy-extensions==1.0.0 oauthlib==3.2.2 openpyxl==3.1.2 oyaml==1.0 packaging==23.1 paramiko==3.3.1 pathspec==0.11.2 pexpect==4.8.0 Pillow==10.0.0 pkginfo==1.9.6 platformdirs==3.10.0 pluggy==1.3.0 poetry==1.6.1 poetry-core==1.7.0 poetry-plugin-export==1.5.0 prompt-toolkit==3.0.39 psycopg==3.1.10 psycopg-binary==3.1.10 ptyprocess==0.7.0 pycodestyle==2.11.0 pycparser==2.21 pydantic==2.3.0 pydantic_core==2.6.3 pydyf==0.7.0 pyflakes==3.1.0 pyhumps==3.8.0 PyJWT==2.8.0 PyNaCl==1.5.0 pyphen==0.14.0 pyproject_hooks==1.0.0 pyrsistent==0.19.3 PySocks==1.7.1 pytest==7.4.0 pytest-cov==4.1.0 pytest-django==4.5.2 python-dateutil==2.8.2 pytz==2023.3 PyYAML==6.0.1 rapidfuzz==2.15.1 redis==5.0.0 requests==2.31.0 requests-mock==1.11.0 requests-oauthlib==1.3.1 requests-toolbelt==1.0.0 s3transfer==0.6.2 SecretStorage==3.3.3 sentry-sdk==1.30.0 shellingham==1.5.3 six==1.16.0 soupsieve==2.4.1 sqlparse==0.4.4 tinycss2==1.2.1 tokenize-rt==5.2.0 tomlkit==0.12.1 tqdm==4.66.1 trove-classifiers==2023.8.7 typing-inspect==0.9.0 typing_extensions==4.7.1 tzdata==2023.3 urllib3==1.26.16 vine==5.0.0 virtualenv==20.24.5 watchtower==3.0.1 wcwidth==0.2.6 weasyprint==59.0 webencodings==0.5.1 whitenoise==6.5.0 wrapt==1.15.0 youtrack-sdk @ git+ssh://git@github.com/moneymeets/youtrack-sdk.git@4071e614e997be70c219abc6d21e1c33d759f5a6 zipp==3.16.2 zopfli==0.2.2

zyv commented 1 year ago

@Zylphrex

  1. Does this happen on app startup? Or does it happen after some time?

No, this happens after some time and so far we couldn't determine when & why exactly does it happen.

As you can see, we run gunicorn with multiple threads (and apparently Sentry profiler is running in a separate thread on top of that). The segfaults happen under load, if the load is low, then they might not happen for days.

We are also getting those segfaults in Celery background workers. This was not the case as we were using gevent, but due to extreme bugginess of gevent under Python 3.11 (https://github.com/gevent/gevent/issues/1985) we also had to switch to threads there...

  1. Was an interrupt sent to the app?

If you mean SIGINT to gunicorn, then no, it's not a factor.

fnmut commented 1 year ago

Hi, we are also experiencing a segfault in a multi-thread context that appears to be coming from the sentry profiler:

Traceback for thread 63 [Has the GIL] (most recent call last):
    (Python) File "/usr/local/lib/python3.11/threading.py", line 995, in _bootstrap
        self._bootstrap_inner()
    (Python) File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
        self.run()
    (Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/integrations/threading.py", line 67, in run
        return old_run_func(self, *a, **kw)
    (Python) File "/usr/local/lib/python3.11/threading.py", line 975, in run
        self._target(*self._args, **self._kwargs)
    (Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 914, in run
        self.sampler()
    (Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 814, in _sample_stack
        sample = [
<200b>
    (Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 815, in <listcomp>
        (str(tid), extract_stack(frame, cache, cwd))
    (Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 276, in extract_stack
        frame_ids = tuple(frame_id(raw_frame) for raw_frame in raw_frames)
    (Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 276, in <genexpr>
        frame_ids = tuple(frame_id(raw_frame) for raw_frame in raw_frames)
    (Python) File "/usr/local/lib/python3.11/site-packages/sentry_sdk/profiler.py", line 303, in frame_id
        return (raw_frame.f_code.co_filename, raw_frame.f_lineno)

We're going to see if turning the profiler off stops it from happening.

antonpirker commented 1 year ago

Hey @fnmut ! Thanks for commenting, please keep us posted.

iamyoudi commented 11 months ago

We are also facing similar issue. Even after disabling the profiler we are seeing this issue. It stops only when we disable sentry or disable the logger integration (loguru).

python -VV Python 3.11.6 (main, Nov 29 2023, 04:47:02) [GCC 10.2.1 20210110]

pip freeze

aioboto3==10.4.0
aiobotocore==2.4.2
aiofile==3.8.8
aiofiles==23.2.1
aiohttp==3.9.1
aioitertools==0.11.0
aiokafka==0.9.0rc0
aiopath==0.6.11
aiosignal==1.3.1
alembic==1.12.1
anyio==3.7.1
arq==0.25.0
async-timeout==4.0.3
asyncpg==0.28.0
attrs==21.4.0
betterproto==2.0.0b6
black==23.11.0
boto3==1.24.59
botocore==1.27.59
cachetools==5.3.2
caio==0.9.13
cassandra-driver==3.28.0
certifi==2023.11.17
charset-normalizer==3.3.2
click==8.1.7
expiringdict==1.2.2
fastapi==0.95.2
fastapi-pagination==0.10.0
frozenlist==1.4.0
geomet==0.2.1.post1
google-api-core==2.14.0
google-api-python-client==2.109.0
google-auth==2.23.4
google-auth-httplib2==0.1.1
google-cloud-profiler==4.1.0
googleapis-common-protos==1.61.0
greenlet==3.0.1
grpc-stubs==1.53.0.3
grpcio==1.59.3
grpcio-tools==1.59.3
grpclib==0.4.6
h11==0.14.0
h2==4.1.0
hiredis==2.2.3
hpack==4.0.0
httplib2==0.22.0
httptools==0.6.1
hyperframe==6.0.1
idna==3.6
isort==5.12.0
Jinja2==3.1.2
jmespath==1.0.1
loguru==0.7.2
Mako==1.3.0
MarkupSafe==2.1.3
multidict==6.0.4
mypy-extensions==1.0.0
mypy-protobuf==3.5.0
packaging==23.2
pathspec==0.11.2
platformdirs==4.0.0
protobuf==4.25.1
psycopg2-binary==2.9.9
pyasn1==0.5.1
pyasn1-modules==0.3.0
pydantic==1.10.13
PyJWT==2.8.0
pyparsing==3.1.1
python-dateutil==2.8.2
python-dotenv==0.21.1
PyYAML==5.3
redis==5.0.1
requests==2.31.0
rsa==4.9
s3transfer==0.6.2
sentry-sdk==1.38.0
six==1.16.0
sniffio==1.3.0
SQLAlchemy==1.4.41
sqlalchemy2-stubs==0.0.2a37
sqlmodel==0.0.8
starlette==0.27.0
tenacity==8.2.3
types-protobuf==4.24.0.4
typing_extensions==4.8.0
uritemplate==4.1.1
urllib3==1.26.18
uvicorn==0.24.0.post1
uvloop==0.19.0
watchfiles==0.21.0
websockets==10.4
wrapt==1.16.0
yarl==1.9.3
zyv commented 11 months ago

We were forced to upgrade to Python 3.12 to get rid of this issue. Sadly found no solution with Python 3.11, but with Python 3.12 the issue seems to be fixed.

antonpirker commented 11 months ago

@Zylphrex do you have any clue why this could happen in Python 3.11 but not in 3.12?

zyv commented 11 months ago

@antonpirker I've linked a related CPython issue in Python 3.11 and they fixed more of those in Python 3.12... but apparently nobody was so impressed as to backport these fixes to Python 3.11.

antonpirker commented 11 months ago

@iamyoudi have to tried upgradint to Python 3.12 to get rid of the problem?

antonpirker commented 11 months ago

Thanks @zyv

So I guess there is nothing we can do on our end.

@zyv mentioned the problem in the CPython repo: https://github.com/python/cpython/issues/99110

The work around for this is to upgrade to Python 3.12.

I am closing this but will keep the comments open, so people can find this and maybe post future improvements in Python 3.11 or other workarounds.

iris-baton commented 9 months ago

@antonpirker, I work with @fnmut. We've upgraded to Python 3.12, turned profiler back on... and segfaults again. Latest stacktrace is

Traceback for thread 101 [Has the GIL] (most recent call last):
    (Python) File "/usr/local/lib/python3.12/threading.py", line 1030, in _bootstrap
        self._bootstrap_inner()
    (Python) File "/usr/local/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
        self.run()
    (Python) File "/usr/local/lib/python3.12/site-packages/sentry_sdk/integrations/threading.py", line 70, in run
        return old_run_func(self, *a, **kw)
    (Python) File "/usr/local/lib/python3.12/threading.py", line 1010, in run
        self._target(*self._args, **self._kwargs)
    (Python) File "/usr/local/lib/python3.12/site-packages/sentry_sdk/profiler.py", line 943, in run
        self.sampler()
    (Python) File "/usr/local/lib/python3.12/site-packages/sentry_sdk/profiler.py", line 829, in _sample_stack
        (str(tid), extract_stack(frame, cache, cwd))
    (Python) File "/usr/local/lib/python3.12/site-packages/sentry_sdk/profiler.py", line 270, in extract_stack
        f_back = raw_frame.f_back

executable: dramatiq arguments: /usr/local/bin/python3.12 /usr/local/bin/dramatiq <task_file> -p 3 -t 12 -Q

Python version: 3.12.1

sentry-sdk 1.36.0

And the stacktrace according to gdb

warning: Build-id of /usr/local/bin/../lib/libpython3.12.so.1.0 does not match core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/bin/python3.12 /usr/local/bin/dramatiq baton_tms.tasks -p 3 -t 12 -Q'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc63515cbe1 in local_clear (self=0x7fc634a57100) at Python/pystate.c:1968
1968    Python/pystate.c: No such file or directory.
[Current thread is 1 (Thread 0x7fc5db7fe6c0 (LWP 101))]
(gdb) bt
#0  0x00007fc63515cbe1 in local_clear (self=0x7fc634a57100) at Python/pystate.c:1968
#1  0x00007fc635421760 in PyFunction_Type () from /usr/local/bin/../lib/libpython3.12.so.1.0
#2  0x00007fc6350c99bd in _PyDictOrValues_GetValues (dorv=...) at ./Include/internal/pycore_object.h:389
#3  _PyObject_GenericGetAttrWithDict (obj=<unknown at remote 0x7fc635038f10>, 
    name=<type at remote 0x7fc635421760>, dict=<optimized out>, suppress=0) at Objects/object.c:1442
#4  0x00007fc6350ac7f5 in _PyObject_FastCallDictTstate (tstate=0x7fc61b2cf400, callable='f_back', 
    args=0x7fc635421760 <PyFrame_Type>, nargsf=140489272850848, kwargs=0x0) at Objects/call.c:149
#5  0x0000000000000000 in ?? ()
(gdb) 
zyv commented 6 months ago

That's funny, it seems that f_back stuff still wasn't completely fixed, but it works for us with threads. Are there any open CPython issues with respect to that?

Zylphrex commented 6 months ago

What Python version are you running? I've been checking on this issue but I haven't found any new open issues in CPython that look related to this.