getsentry / sentry-python

The official Python SDK for Sentry.io
https://sentry.io/for/python/
MIT License
1.86k stars 486 forks source link

Sentry background worker is chronically blocking async event loop when many exceptions are raised #2824

Open cpvandehey opened 6 months ago

cpvandehey commented 6 months ago

How do you use Sentry?

Self-hosted/on-premise

Version

1.40.6

Steps to Reproduce

Hello! And thanks for reading my ticket :)

The python sentry client is a synchronous client library that is retrofitted to fit the async model (by spinning off separate threads to avoid disrupting the event loop thread -- see background worker (1) for thread usage).

Under healthy conditions, the sentry client doesn’t need to make many web requests. However, if conditions become rocky and exceptions are frequently raised (caught or uncaught), the sentry client may become an extreme inhibitor to the app event loop (assuming high sample rate). This is due to the necessary OS thread context switching that effectively pauses/blocks the event loop to work on other threads (i.e the background worker (1)). This is not a recommended pattern (obviously) due to the costs of switching threads, but can be useful for quickly/lazily retrofitting sync code.

Relevant flow - in short: Every time an exception is raised (caught or uncaught) in my code, a web request is immediately made to dump the data to sentry when sampled. Since sentry’s background worker is thread based (1), this will trigger an thread context switch and then a synchronous web request to dump the data to sentry. When applications receive many exceptions in a short period of time, this becomes a context switching nightmare.

Suggestion: In an ideal world, sentry would asyncify its Background worker to use a task (1) and its transport layer (2) would use aiohttp. I don't think this is of super high complexity, but I could be wrong.

An immediate workaround could be made with more background worker control. If sentry’s background worker made web requests to dump data at configurable intervals, it would behave far more efficiently for event loops apps. At the moment, the background worker always dumps data immediately with regards to exceptions. In my opinion, if sentry is flushing data at app exit, having a 60 second timer to dump data would alleviate most of the symptoms I described above without ever losing data (albeit it would be up to 60 seconds slower).

(1) - https://github.com/getsentry/sentry-python/blob/1b0e932c3f827c681cdd20abfee9afc55e5d141c/sentry_sdk/worker.py#L20

(2) - https://github.com/getsentry/sentry-python/blob/1b0e932c3f827c681cdd20abfee9afc55e5d141c/sentry_sdk/transport.py#L244

Expected Result

I expect to have less thread context switching when using sentry.

Actual Result

I see a lot of thread context switching when there are high exception rates.

antonpirker commented 6 months ago

Hey @cpvandehey ! Thanks for the great ticket!

sentrivana commented 6 months ago

Hey @cpvandehey, thanks for writing in. Definitely agree with you that our async support could definitely use some improvements (see e.g. https://github.com/getsentry/sentry-python/issues/1735, https://github.com/getsentry/sentry-python/issues/2007, https://github.com/getsentry/sentry-python/issues/2184, and multiple other issues).

Using an aiohttp client and an asyncio task both sounds doable and would go a long way in making the SDK more async friendly.

antonpirker commented 6 months ago

We could detect if aiohttp is in the project and based on this enable the new async support automatically. (have not thought long about this if this could lead to problems though..)

cpvandehey commented 4 months ago

Hey @sentrivana / @antonpirker , any update on the progress for this? Happy to help

sentrivana commented 4 months ago

Hey @cpvandehey, no news on this, but PRs are always welcome if you feel like giving this a shot.

cpvandehey commented 2 months ago

I see the milestone for this task was removed. @antonpirker, should we still consider writing our own attempt?

sentrivana commented 2 months ago

Hey @cpvandehey, sorry for the confusion regarding the milestone. Previously we were (mis)using milestones to group issues together, but have now decided to abandon that system. Nothing has changed priority wise.

cpvandehey commented 1 month ago

Alright, I think im going to start to implement this. Stay tuned.

cpvandehey commented 1 month ago

Coming up for air after a few hours invested/tinkering. I realized a few things that I should discuss before proceeding:

Exhales

Like most async integrations, they seem easy at the surface, but end up touching a lot of the code. Im wondering if I am on the right track with what the python sentry folks want for this design. I would love for this to be collaborative and iterative. Let me know your thoughts on the approach above :)

antonpirker commented 1 month ago

Hey @cpvandehey !

Thanks for this great issue and your motivation! You are right, our async games is currently not the best, and we should, and will improve on it.

To your deep dive:

Currently we are in the middle of doing a big refactor, where we try to use OpenTelementry (OTel) under the hood for our performance instrumentation.

We should not do the OTel and the async refactoring at the same time, this will lead to lots of complexity and head aches.

So I proposal is, that we first finish the OTel refactor and then tackle the async refactor. The Otel refactor will probably still take a couple of months (like 2-3, not 10-12). Do you think you can wait a while until we get started with this?

As this is a huge task we should then create a milestone and split the task up in smaller chunks, that can be tackled by multiple people at the same time.

cpvandehey commented 1 month ago

Do you think you can wait a while until we get started with this?

yes

As this is a huge task we should then create a milestone and split the task up in smaller chunks, that can be tackled by multiple people at the same time.

sounds good!