uWSGI workers dying in 1.40.0

fiendish commented 10 months ago

How do you use Sentry?

Sentry Saas (sentry.io)

Version

1.40.0

Steps to Reproduce

I don't have a reproduction yet due to the nature of the issue, for which I apologize, but with 1.40.0 my server processes are silently dying after about 90 minutes of run time. This does not happen with 1.39.2. I just want to get it on your radar that the latest update may be causing issues for some. I wish I had more information, but I don't at this time.

Expected Result

^

Actual Result

^

seb-b commented 10 months ago

I also had a similar issue with 1.40, it was killing my uwsgi workers, had to downgrade to 1.39. Also wish I had more information to give but there wasn't much to go on. The issue was worse with profiling on (workers would be killed via SIGBUS)

sentrivana commented 10 months ago

@fiendish @seb-b Hey folks, thanks for bringing this to our attention. Even if there is no repro, more info could help point us in the right direction -- what kind of apps (Django, Flask, FastAPI, something else?) are you running? Which SDK integrations are active (initing the SDK with debug=True should log this at startup)?

Looking at the changelog from 1.40.0 there's a couple things we could try:

turning off metrics collection:

sentry_sdk.init(
    ... # your usual stuff
    _experiments={
        "enable_metrics": False,
    }
)

turning off DB query source capture:

sentry_sdk.init(
    ... # your usual stuff
    enable_db_query_source=False,
)

fiendish commented 10 months ago

Oops, yeah, sorry. In my haste I forgot to include. In my case it's Flask 2.1.3 + uWSGI 2.0.23, and

integrations = [
    FlaskIntegration(),
    LoggingIntegration(
        level=logging.INFO,
        event_level=logging.ERROR,
    )
]

I'll try to find time over the weekend or monday to try those things.

gerricom commented 10 months ago

Same here with Django 3.2 LTS, uWSGI 2.0.23. This are my settings:

sentry_logging = LoggingIntegration(
    level=logging.INFO,  # Capture info and above as breadcrumbs
    event_level=logging.ERROR,  # Send errors as events
)
sentry_settings = {
    "dsn": SENTRY_LIVE_DSN,
    "integrations": [DjangoIntegration(), sentry_logging],
    "send_default_pii": True,
    "enable_tracing": True,
}

sentrivana commented 10 months ago

Thanks @gerricom -- are you able to try out turning metrics and/or DB query source off as outlined here?

gerricom commented 10 months ago

Yes, I'd like to try this. But I'll have to wait until it is a bit more calm here in the evening (in ~6hrs or so).

arvinext commented 9 months ago

We hit the same issues to with Python (version 3.11) Django (version 4.1.10) app running on uWSGI (version 2.0.23) and sentry-sdk 1.40.0. We upgraded from sentry-sdk 1.15.0.

We did not change the sentry_sdk.init, which is

sentry_sdk.init(  # type: ignore[abstract]
        dsn=SENTRY_DSN,
        integrations=[
            sentry_django(),
            sentry_logging(event_level=None),
        ],
        release='myapp',
        environment=settings.APP_ENVIRONMENT,
        attach_stacktrace=True,
        sample_rate=DEFAULT_SAMPLE_RATE,
        shutdown_timeout=DEFAULT_SHUTDOWN_TIMEOUT,
        before_send=alert_utils.before_send,
    )

Our uWSGI Setup: In our case, we have the following setting to recycle uwsgi workers every 1 hr

max-worker-lifetime= 3600 #seconds

After 1 hr of worker lifetime, we see that the uWSGI master complains that the listen queue is full. Log snippet attached below. The workers never get respawned and we eventually loose the kubernetes pod.

Fri Feb  2 17:53:20 2024 - worker 1 lifetime reached, it was running for 21 second(s)
Fri Feb  2 17:53:20 2024 - worker 2 lifetime reached, it was running for 21 second(s)
[WARNING] 032/175340 (10426) : Server uwsgi/uwsgi-7001 is DOWN, reason: Layer7 timeout, check duration: 6000ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 032/175342 (10426) : Server uwsgi/uwsgi-7002 is DOWN, reason: Layer7 timeout, check duration: 6000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 032/175342 (10426) : Server uwsgi/uwsgi-7000 is DOWN, reason: Layer7 timeout, check duration: 6001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 032/175342 (10426) : backend 'uwsgi' has no server available!
Fri Feb  2 17:59:35 2024 - *** uWSGI listen queue of socket "127.0.0.1:7101" (fd: 3) full !!! (32/32) ***
Fri Feb  2 17:59:36 2024 - *** uWSGI listen queue of socket "127.0.0.1:7101" (fd: 3) full !!! (32/32) ***

I was able to reproduce this in our test environment by keeping the max-worker-lifetime= 20 #seconds and hence making them die sooner.. I could see the same behavior as logs above.

Additionally I noticed that the worker processes never went away when I did ps aux. Which makes me think that, supervisord which we use to restart the processes, thought the processes never died.

I also confirmed that, this issue does not occur when we kill by _raise_sigint() based on PSS memory or when workers get reloaded using reload-on-rss parameter.

I have noticed this to happen only when max-worker-lifetime value kicks-in.

kberzinch commented 9 months ago

I'm getting a segfault:

!!! uWSGI process 127 got Segmentation Fault !!!
*** backtrace of 127 ***
/usr/local/bin/uwsgi(uwsgi_backtrace+0x2f) [0x560ebeeb224f]
/usr/local/bin/uwsgi(uwsgi_segfault+0x23) [0x560ebeeb2613]
/lib/x86_64-linux-gnu/libc.so.6(+0x38d60) [0x7f09d1c1cd60]
/usr/local/lib/libpython3.12.so.1.0(+0x2f0511) [0x7f09d20e5511]
/usr/local/lib/libpython3.12.so.1.0(_PyThread_CurrentFrames+0x82) [0x7f09d1fb7a66]
/usr/local/lib/libpython3.12.so.1.0(+0x2390a0) [0x7f09d202e0a0]
/usr/local/lib/libpython3.12.so.1.0(PyObject_Vectorcall+0x51) [0x7f09d204ab61]
/usr/local/lib/libpython3.12.so.1.0(_PyEval_EvalFrameDefault+0x6a8e) [0x7f09d203cfae]
/usr/local/lib/libpython3.12.so.1.0(+0x27198b) [0x7f09d206698b]
/usr/local/lib/libpython3.12.so.1.0(_PyEval_EvalFrameDefault+0x86cb) [0x7f09d203ebeb]
/usr/local/lib/libpython3.12.so.1.0(+0x27198b) [0x7f09d206698b]
/usr/local/lib/libpython3.12.so.1.0(+0x1c85ea) [0x7f09d1fbd5ea]
/usr/local/lib/libpython3.12.so.1.0(+0x185106) [0x7f09d1f7a106]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f09d2612ea7]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f09d1cdfa2f]
*** end of backtrace ***
DAMN ! worker 1 (pid: 127) died, killed by signal 11 :( trying respawn ...
Respawned uWSGI worker 1 (new pid: 150)

Disabling metrics fixed it.

Startup logs:

*** Starting uWSGI 2.0.23 (64bit) on [Sat Feb  3 19:00:44 2024] ***
compiled with version: 10.2.1 20210110 on 03 February 2024 18:58:51
os: Linux-4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023
nodename: [redacted]
machine: x86_64
clock source: unix
pcre jit disabled
detected number of CPU cores: 4
current working directory: /app
detected binary path: /usr/local/bin/uwsgi
chdir() to /app/
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to UNIX address /var/opt/nomad/run/divvy-onboarding-ux-test-d31774b4-7f2e-62a7-356a-13ab41d8f3b1.sock fd 3
uwsgi socket 1 bound to TCP address 0.0.0.0:20324 fd 4
Python version: 3.12.1 (main, Feb  1 2024, 04:52:14) [GCC 10.2.1 20210110]
Python main interpreter initialized at 0x7fb214c30028
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 385000 bytes (375 KB) for 4 cores
*** Operational MODE: preforking ***
 [sentry] DEBUG: Setting up integrations (with default = True)
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.aiohttp.AioHttpIntegration: AIOHTTP not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.boto3.Boto3Integration: botocore is not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.bottle.BottleIntegration: Bottle not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.celery.CeleryIntegration: Celery not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.django.DjangoIntegration: Django not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.falcon.FalconIntegration: Falcon not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.fastapi.FastApiIntegration: Starlette is not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.httpx.HttpxIntegration: httpx is not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.pyramid.PyramidIntegration: Pyramid not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.rq.RqIntegration: RQ not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.sanic.SanicIntegration: Sanic not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.sqlalchemy.SqlalchemyIntegration: SQLAlchemy not installed.
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.starlette.StarletteIntegration: Starlette is not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.tornado.TornadoIntegration: Tornado not installed
 [sentry] DEBUG: Setting up previously not enabled integration flask
 [sentry] DEBUG: Setting up previously not enabled integration pure_eval
 [sentry] DEBUG: Setting up previously not enabled integration argv
 [sentry] DEBUG: Setting up previously not enabled integration atexit
 [sentry] DEBUG: Setting up previously not enabled integration dedupe
 [sentry] DEBUG: Setting up previously not enabled integration excepthook
 [sentry] DEBUG: Setting up previously not enabled integration logging
 [sentry] DEBUG: Setting up previously not enabled integration modules
 [sentry] DEBUG: Setting up previously not enabled integration stdlib
 [sentry] DEBUG: Setting up previously not enabled integration threading
 [sentry] DEBUG: Setting up previously not enabled integration redis
 [sentry] DEBUG: Did not enable default integration redis: Redis client not installed
 [sentry] DEBUG: Enabling integration flask
 [sentry] DEBUG: Enabling integration pure_eval
 [sentry] DEBUG: Enabling integration argv
 [sentry] DEBUG: Enabling integration atexit
 [sentry] DEBUG: Enabling integration dedupe
 [sentry] DEBUG: Enabling integration excepthook
 [sentry] DEBUG: Enabling integration logging
 [sentry] DEBUG: Enabling integration modules
 [sentry] DEBUG: Enabling integration stdlib
 [sentry] DEBUG: Enabling integration threading
 [sentry] DEBUG: Setting SDK name to 'sentry.python.flask'
 [sentry] DEBUG: [Profiling] Setting up profiler in thread mode
WSGI app 0 (mountpoint='') ready in 2 seconds on interpreter 0x7fb214c30028 pid: 1 (default app)
spawned uWSGI master process (pid: 1)
spawned uWSGI worker 1 (pid: 6, cores: 1)
spawned uWSGI worker 2 (pid: 7, cores: 1)
spawned uWSGI worker 3 (pid: 8, cores: 1)
spawned uWSGI worker 4 (pid: 9, cores: 1)
 [sentry] DEBUG: [Tracing] Discarding <http.server> transaction <generic WSGI request> because traces_sampler returned 0 or False
 [sentry] DEBUG: [Profiling] Discarding profile because transaction is discarded.
 [sentry] DEBUG: Discarding transaction because sampled = False
[pid: 6|app: 0|req: 1/1] [redacted] () {28 vars in 388 bytes} [Sat Feb  3 19:00:47 2024] GET /ping => generated 21 bytes in 9 msecs (HTTP/1.1 200) 2 headers in 71 bytes (1 switches on core 0)

Repo is here: https://github.com/RoboJackets/divvy-onboarding-ux

harmant commented 9 months ago

The same problem with sentry-sdk 1.40 (Debian 11, python 3.9.2, django 4.2.9, uwsgi 2.0.23)

uWSGI workers hang after lifetime period.

mitsuhiko commented 9 months ago

The metrics code “just” spawns a thread. I don’t think we spawn this differently than our other background threads but that’s worth checking.

uwsgi is kinda crashy and I have seen bad failures in the past with threads. The metrics code other than the flusher thread should be dead unless someone calls into the metrics functions which is not happening by default.

One difference I think is that we do not sleep but wait on a condvar. Maybe we can repro with a dummy uwsgi and wait until it crashes, then inspect the core dump.

sentrivana commented 9 months ago

Folks, we're releasing SDK version 1.40.1 in a couple moments that includes a fix for the uWSGI workers hanging issue. Let us know if upgrading fixes the issue.

fiendish commented 9 months ago

1.40.1 did not fix the issue on my end. I'm still struggling to find time to explore the code but a colleague should be testing out some selected reversions (reversing the default activations of enable_metrics and enable_db_query_source) tomorrow morning.

sentrivana commented 9 months ago

Thanks @fiendish, please keep us posted.

I see 2-3 possibly separate issues in this thread:

Workers hanging after their max lifetime. I was able to repro this and the fix we shipped in 1.40.1 should fix this. If this is not the case for someone with workers hanging after max lifetime, please let me know.
Workers dying after some time. Still investigating this, any extra info is welcome. 2a. Workers segfaulting. Goes away when metrics are turned off. @kberzinch do you get the segfault consistently? Does it occur at startup or after some time?

sentrivana commented 9 months ago

The one good thing is that we can rule out this having anything to do with the other potential culprit (query source) since the app you linked @kberzinch is as far as I can tell not using a DB at all. So I'm pretty sure this is metrics, maybe in tandem with profiling since there was a report that profiling makes the issue worse.

Can any of you folks that are running into the workers dying share your uwsgi config? (I've already seen the one in the repo you linked @kberzinch, have been trying to use that to repro, no luck so far.) Maybe we can find some common denominator there that would help us reproduce this.

kberzinch commented 9 months ago

The worker segfaults immediately after requesting an endpoint that uses set_user (then nginx returns a 502 to the client for that request). That happens consistently: sending the same request again will kill the worker again. Other endpoints seem to work OK.

You are correct, that app does not use a DB, it is literally a glorified webform to send me an email 😄

I can try to fiddle around and get a minimum reproducible example, but probably not for a few days, sorry.

fiendish commented 9 months ago

So I'm pretty sure this is metrics, maybe in tandem with profiling since there was a report that profiling makes the issue worse.

Indeed for us just toggling the metrics back off seems to fix the breakage.

example uwsgi ini is

[uwsgi]
master = true
http-socket = [::]:40101
stats = socketdir/foobar_stats.sock
chmod-socket = 666
memory-report = true
procname-prefix = foobar:40101-
auto-procname = true
wsgi-file = foo/bar/app.py
force-cwd = foo
need-app = true
enable-proxy-protocol = true
reload-mercy=3
worker-reload-mercy=3
add-header=X-Version-Label: foobar-202402072102-zorp
cheaper-algo = busyness
cheaper-overload = 30
cheaper-busyness-backlog-nonzero = 5
cheaper-initial = 8
cheaper = 8
workers = 20
max-requests = 1000
harakiri = 60
logformat=%(addr) - %(user) %(time) "%(method) %(uri) %(proto)" %(status) %(size) "%(referer)" "%(uagent)" %(pid) %(host) %(micros)
disable-logging=true
die-on-term=true
ignore-sigpipe=true
ignore-write-errors=true
disable-write-exception=true

sentrivana commented 9 months ago

Thanks for verifying and for the config @fiendish!

@kberzinch I'm obviously running a rather stripped down version of your app since I don't have all the necessary env vars but still can't repro. Will continue trying though.

We're continuing to look into this -- in the meantime I've also opened a uwsgi issue to possibly get some insight there.

fiendish commented 9 months ago

I know it's a focused issue here, but you might consider yanking the 1.40.x releases. Or at least reverting the part that changed the default enablement of the metrics experiment.

xrmx commented 9 months ago

@fiendish does adding enable-threads=true to your uwsgi config make a difference?

pcarn commented 9 months ago

We just spent several engineering hours investigating why our service was crashing, and traced it to sentry-sdk 1.40.2. Please yank the versions, this is a critical bug.

NealAJohnson commented 9 months ago

This issue just bit our services as well. Python (version 3.11.7) Django (version 5.0.1) app running on uWSGI (version 2.0.23) using sentry-sdk (version 1.40.2). Spent a couple hours trying to figure out why our app was instantly crashing. Haven't found a resolution on our end.

mitsuhiko commented 9 months ago

Unfortunately uwsgi has long standing threading bugs and crashes and sometimes in the metrics flusher is triggering one of those bugs. It’s almost equivalent to the flusher we use for years. The only difference I can see is the event. However it also relies on the GIL functioning heavily and I wonder if uwsgi does not hold the GIL properly.

Since uwsgi is no longer fixing such crashers I’m not sure what a good cause of action here is. For now we can just turn it off. Ultimately the issue lies with uwsgi :(

markkuleinio commented 9 months ago

Since uwsgi is no longer fixing such crashers

Can you be more specific if you know something that is not commonly known?

https://uwsgi-docs.readthedocs.io/en/latest/ says "The project is in maintenance mode (only bugfixes [...])"

mitsuhiko commented 9 months ago

We reported quite a few segfaults over the years. Lack of test suite and invasiveness of some of these fixes impedes resolution.

sentrivana commented 9 months ago

Folks, we're rolling out 1.40.3 which disables metrics for uWSGI for now.

PAStheLoD commented 9 months ago

Hi all,

I know I'm a bit late to the party, but it seems that --lazy is a "workaround" (on this particular Flask 3.0.2 app).

uwsgi  --processes 3 --master  -w index:app --need-app  --env mount=/api  --http-socket 0.0.0.0:8080    --env SENTRY_DSN=https://xxx  --enable-threads  --die-on-term  --reload-mercy 1    --threads 1 --worker-reload-mercy 1

reliably produces the segfault on 2.0.23, 2.0.22, 2.0.21, and with py3.11.6 I cannot build older ones.

with a WSGI stub function I can't reproduce it (even if everything else - sentry, pymongo, and a bunch of other stuff - is loaded) .. but with a minimal Flask app it can be reproduced.

(a.py)

import sentry_sdk

sentry_sdk.init(
        dsn='insert valid sentry dsn',
        # Set traces_sample_rate to 1.0 to capture 100%
        # of transactions for performance monitoring.
        traces_sample_rate=1.0,
        # Set profiles_sample_rate to 1.0 to profile 100%
        # of sampled transactions.
        # We recommend adjusting this value in production.
        profiles_sample_rate=1.0,
        environment='local-dev',
        debug=True
    )

from flask import Flask

flask_app = Flask(__name__)

@flask_app.route('/')
def index():
    return 'k'

using

uwsgi  -w a:flask_app --need-app  --http-socket 0.0.0.0:8080    -p 4

with -p 1 it does not occur, nor when the request gets assigned to the first worker. (and this kind of explains why lazy helps) probably the issue is that something gets initialized before forking (you don't say!? :))

*** Starting uWSGI 2.0.23 (64bit) on [Sat Feb 10 19:57:45 2024] ***
compiled with version: 13.2.0 on 06 February 2024 18:47:26
os: Linux-6.5.0-15-generic #15-Ubuntu SMP PREEMPT_DYNAMIC Tue Jan  9 22:39:36 UTC 2024
nodename: lima-default
machine: aarch64
clock source: unix
detected number of CPU cores: 4
current working directory: /opt/lima/xxx/xxx/wsgi-hack
detected binary path: /opt/lima/xxx/backend-venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) *** 
*** WARNING: you are running uWSGI without its master process manager ***
your processes number limit is 31425
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address 0.0.0.0:8080 fd 3
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) *** 
Python version: 3.11.6 (main, Oct  8 2023, 05:06:43) [GCC 13.2.0]
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0xffff8d5207b8
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) *** 
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 291616 bytes (284 KB) for 4 cores
*** Operational MODE: preforking ***
/opt/lima/xxx/backend-venv/lib/python3.11/site-packages/sentry_sdk/_compat.py:161: Warning: We detected the use of uwsgi with disabled threads.  This will cause issues with the transport you are trying to use.  Please enable threading for uwsgi.  (Add the "enable-threads" flag).
  warn(
 [sentry] DEBUG: Setting up integrations (with default = True)
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.aiohttp.AioHttpIntegration: AIOHTTP not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.boto3.Boto3Integration: botocore is not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.bottle.BottleIntegration: Bottle not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.celery.CeleryIntegration: Celery not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.django.DjangoIntegration: Django not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.falcon.FalconIntegration: Falcon not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.fastapi.FastApiIntegration: Starlette is not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.httpx.HttpxIntegration: httpx is not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.pyramid.PyramidIntegration: Pyramid not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.rq.RqIntegration: RQ not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.sanic.SanicIntegration: Sanic not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.sqlalchemy.SqlalchemyIntegration: SQLAlchemy not installed.
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.starlette.StarletteIntegration: Starlette is not installed
 [sentry] DEBUG: Did not import default integration sentry_sdk.integrations.tornado.TornadoIntegration: Tornado not installed
 [sentry] DEBUG: Setting up previously not enabled integration argv
 [sentry] DEBUG: Setting up previously not enabled integration atexit
 [sentry] DEBUG: Setting up previously not enabled integration dedupe
 [sentry] DEBUG: Setting up previously not enabled integration excepthook
 [sentry] DEBUG: Setting up previously not enabled integration logging
 [sentry] DEBUG: Setting up previously not enabled integration modules
 [sentry] DEBUG: Setting up previously not enabled integration stdlib
 [sentry] DEBUG: Setting up previously not enabled integration threading
 [sentry] DEBUG: Setting up previously not enabled integration flask
 [sentry] DEBUG: Setting up previously not enabled integration redis
 [sentry] DEBUG: Did not enable default integration redis: Redis client not installed
 [sentry] DEBUG: Enabling integration argv
 [sentry] DEBUG: Enabling integration atexit
 [sentry] DEBUG: Enabling integration dedupe
 [sentry] DEBUG: Enabling integration excepthook
 [sentry] DEBUG: Enabling integration logging
 [sentry] DEBUG: Enabling integration modules
 [sentry] DEBUG: Enabling integration stdlib
 [sentry] DEBUG: Enabling integration threading
 [sentry] DEBUG: Enabling integration flask
 [sentry] DEBUG: Setting SDK name to 'sentry.python.flask'
 [sentry] DEBUG: [Profiling] Setting up profiler in thread mode
WSGI app 0 (mountpoint='') ready in 1 seconds on interpreter 0xffff8d5207b8 pid: 826678 (default app)
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) *** 
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI worker 1 (pid: 826678, cores: 1)
spawned uWSGI worker 2 (pid: 826681, cores: 1)
spawned uWSGI worker 3 (pid: 826682, cores: 1)
spawned uWSGI worker 4 (pid: 826683, cores: 1)

 [sentry] DEBUG: [Tracing] Starting <http.server> transaction <generic WSGI request>
 [sentry] DEBUG: [Profiling] Initializing profile
 [sentry] DEBUG: [Profiling] Starting profile
 [sentry] DEBUG: [Profiling] Stopping profile
 [sentry] DEBUG: [Profiling] Discarding profile because insufficient samples.
!!! uWSGI process 826683 got Segmentation Fault !!!
*** backtrace of 826683 ***
uwsgi(uwsgi_backtrace+0x30) [0xaaaaab7d6a30]
uwsgi(uwsgi_segfault+0x34) [0xaaaaab7d6e14]
linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffff8d6e38f8]
/usr/lib/aarch64-linux-gnu/libpython3.11.so.1.0(_PyThread_CurrentFrames+0x94) [0xffff8d06bc54]
/usr/lib/aarch64-linux-gnu/libpython3.11.so.1.0(+0x1c4ac4) [0xffff8cf84ac4]
/usr/lib/aarch64-linux-gnu/libpython3.11.so.1.0(PyObject_Vectorcall+0x4c) [0xffff8cf35d4c]
/usr/lib/aarch64-linux-gnu/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x38d4) [0xffff8cec9e14]
/usr/lib/aarch64-linux-gnu/libpython3.11.so.1.0(+0x262970) [0xffff8d022970]
/usr/lib/aarch64-linux-gnu/libpython3.11.so.1.0(+0x177ffc) [0xffff8cf37ffc]
/usr/lib/aarch64-linux-gnu/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x79f4) [0xffff8cecdf34]
/usr/lib/aarch64-linux-gnu/libpython3.11.so.1.0(+0x262970) [0xffff8d022970]
/usr/lib/aarch64-linux-gnu/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x79f4) [0xffff8cecdf34]
/usr/lib/aarch64-linux-gnu/libpython3.11.so.1.0(+0x262970) [0xffff8d022970]
/usr/lib/aarch64-linux-gnu/libpython3.11.so.1.0(+0x177ffc) [0xffff8cf37ffc]
/usr/lib/aarch64-linux-gnu/libpython3.11.so.1.0(+0x399194) [0xffff8d159194]
/usr/lib/aarch64-linux-gnu/libpython3.11.so.1.0(+0x2ba0e0) [0xffff8d07a0e0]
/usr/lib/aarch64-linux-gnu/libc.so.6(+0x837d0) [0xffff8cc837d0]
/usr/lib/aarch64-linux-gnu/libc.so.6(+0xef5cc) [0xffff8ccef5cc]
*** end of backtrace ***

(--enable-threads doesn't make any difference)

xrmx commented 9 months ago

@PAStheLoD thanks for spending time on this. While you are at it, does lazy-apps suffice or do you really need lazy?

sentrivana commented 9 months ago

Thanks @PAStheLoD, this is very helpful. I'd also be interested in whether --lazy-apps has the same effect on your app or if you need --lazy.

My current thoughts on this are that there has to be a way to make this work since the transport worker and profiler work largely the same way and we're not seeing them explode people's apps. One difference (other than the use of threading.Event) that I see in the metrics code vs. other places where we have background threads is that in metrics we start the thread right off the bat, whereas elsewhere the thread is only started on demand. I can imagine this making a difference in prefork mode. I'll align this.

I'm also wondering if upgrading to the recent uWSGI 2.0.24 might make a difference?

Tenzer commented 9 months ago

We were also hit by uWSGI workers getting stuck, presumably due to changes in sentry-sdk - although we did try to revert upgrades without much luck. In the end we ended up migrating from uWSGI to Unit and it's been rock solid for us. Just in case anybody else also get frustrated with uWSGI...

sentrivana commented 9 months ago

Everyone, we've just released 1.40.4 with a tentative fix.

It'd greatly help us out if you could try it out if you're affected by this issue and report back. What you need to do is install sentry-sdk 1.40.4, add the following to your init, and run with your usual uWSGI config.

sentry_sdk.init(
    ... # your usual stuff
    # Metrics collection is still off by default in 1.40.4 if you're under uWSGI,
    # but this can be overridden by
    _experiments={
        "force_enable_metrics": True,
    }
)

Caveat: The fix in 1.40.4 should work as long as you're not manually emitting any metrics (with sentry_sdk.metrics.<metric_type>(...)) on startup (on demand, e.g. on request, is fine). In that case the only known fix so far is to run uWSGI with --lazy-apps to disable preforking mode. We're looking into addressing this as well without requiring --lazy-apps.

Also please note that for the SDK to not break in other unexpected ways uWSGI needs to be run with --enable-threads.

natano commented 9 months ago

I can reproduce the issue.

uwsgi invocation with a.py from @PAStheLoD:

uwsgi -w a:flask_app --need-app --http-socket 127.0.0.1:8080 -p 4 --enable-threads

The segfault is caused by uwsgi not handling threads correctly with default arguments.

When forking python with multiple threads running the fork hooks PyOS_BeforeFork, PyOS_AfterFork_Child and PyOS_AfterFork_Parent have to be run before/after fork. A correct fork sequence looks something like this. uwsgi runs none of those by default. It seems like this is by design, as uwsgi assumes that wsgi apps are single threaded (source).

To make the segfault disappear uwsgi has to be called with --enable-threads (initialize the GIL) and --py-call-uwsgi-fork-hooks (run PyOS_AfterFork_Child in the worker after fork).

uwsgi -w a:flask_app --need-app --http-socket 127.0.0.1:8080 -p 4 --enable-threads --py-call-uwsgi-fork-hooks

Interestingly this behaviour is still not fully correct: Only PyOS_AfterForkChild is called. PyOS_BeforeFork and PyOS_AfterFork_Parent are not called. I'm not sure if that can lead to segfaults too, but in any case not all hooks registered by os.register_at_fork are called when the worker is forked. (The before and after_in_parent hooks are not called.)

Not sure if there is any viable workaround for the sentry SDK, as the corruption happens outside of its control.

IMHO this is not a sentry issue, but caused by unsafe defaults in uwsgi. I wish uwsgi would pick safer defaults.

Used software versions: Linux 6.1.71-1-lts (arch linux kernel build) Python 3.11.8 (arch linux PKGBUILD, but rebuilt with debug symbols) uwsgi 2.0.23 and 2.0.24 sentry_sdk 1.40.2

sentrivana commented 9 months ago

Thanks a lot for looking into this @natano!

I think the best we can do on the SDK side moving forward is to issue a warning on startup if we detect we're in non-lazy mode and --py-call-uwsgi-fork-hooks is not on, just like we do currently if you're running uWSGI without --enable-threads.

The fix in 1.40.4 should work even without --py-call-uwsgi-fork-hooks in most cases because with that release we spawn all threads on demand BUT if you're unlucky and a thread is spawned at startup before forking, you might run into this again.

So to sum up for future reference, if you run into this issue:

run uWSGI with both --enable-threads and --py-call-uwsgi-fork-hooks
OR run uWSGI with --lazy-apps (or --lazy, but this is discouraged) and --enable-threads

thedanfields commented 9 months ago

To add some extra seasoning to this issue.

We saw failures in our AWS Lambda executors after moving to 1.40.3:

Traceback (most recent call last):
  < ... omitted ... >
  File "/var/task/sentry_sdk/integrations/threading.py", line 56, in sentry_start
    return old_start(self, *a, **kw)
  File "/var/lang/lib/python3.9/threading.py", line 899, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

sentrivana commented 9 months ago

@thedanfields I'd wager a guess that's a AWS specific issue - reminds me of https://github.com/getsentry/sentry-python/issues/2632 without the upgrading to Python 3.12 part. I wonder if we've hit some kind of AWS Lambda limit on the number of threads since the SDK is now spawning one additional thread.

Could you open a new issue with this please?

thedanfields commented 9 months ago

@sentrivana , happy to oblige

I created a new issue as requested.

getsentry / sentry-python