Infinite Recursion Issue in Sentry SDK Module

Parthib commented 3 months ago

How do you use Sentry?

Sentry Saas (sentry.io)

Version

2.11.0

Steps to Reproduce

I have been noticing an infinite recursion issue within the Sentry SDK popping up in our staging/production logs, but I am still in the process of trying to reproduce it locally. It seems to happen only once per fresh deployment.

Some background that I think is relevant context:

We use FlaskIntegration()
We upgraded to Sentry 2.5.1 when we started seeing the issue. This happens on 2.11.0 too.
Flask==3.0.3
We use Gunicorn version 22.0.0 with the -k uvicorn.workers.UvicornWorker parameter
Our Flask app is a WSGI middleware with an ASGI wrapper around it.

Here is the relevant part of a sample stacktrace I am seeing (I redacted it a bit to cut to the chase):

message
[2024-07-23 22:07:40,209] ERROR [Request ID: 3c7c2cf9-e453-4cca-813c-f5e57d102f05] in errors: Unhandled exception: maximum recursion depth exceeded while calling a Python object
File "/usr/local/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
...Skipping ahead...
return flask.render_template(
File "/usr/local/lib/python3.9/site-packages/flask/templating.py", line 150, in render_template
return _render(app, template, context)
File "/usr/local/lib/python3.9/site-packages/flask/templating.py", line 128, in _render
before_render_template.send(
File "/usr/local/lib/python3.9/site-packages/blinker/base.py", line 279, in send
result = receiver(sender, **kwargs)
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/integrations/flask.py", line 97, in _add_sentry_trace
trace_meta = Markup(scope.trace_propagation_meta())
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/scope.py", line 572, in trace_propagation_meta
sentry_trace = self.get_traceparent()
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/scope.py", line 514, in get_traceparent
return Scope.get_isolation_scope().get_traceparent()
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/scope.py", line 514, in get_traceparent
return Scope.get_isolation_scope().get_traceparent()
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/scope.py", line 514, in get_traceparent
return Scope.get_isolation_scope().get_traceparent()
[Previous line repeated 2967 more times]
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/scope.py", line 499, in get_traceparent
client = Scope.get_client()
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/scope.py", line 375, in get_client
current_scope = _current_scope.get()

I did some research as to what is going on here, and it looks like Flask defines a signal called before_render_template that calls all registered functions right before a flask template is rendered via a call to flask.render_template. Looks like sentry registers the function _add_sentry_trace with this signal.

When the signal is called, _add_sentry_trace is called, which

Calls scope.trace_propagation_meta()
Calls self.get_traceparent()
Calls Scope.get_isolation_scope().get_traceparent()

where it finally gets stuck in an infinite recursion of calling get_traceparent()

My attempts to reproduce locally are ongoing. Locally, I can't seem to get it to hit that return Scope.get_isolation_scope().get_traceparent() case. Usually, self.span is defined and get_traceparent exits here.

When running with gunicorn, it looks like self.span doesn't get defined initially, but the function still manages to safely return within this conditional.

Even in our staging/production environment, we only see the issue occur once per deployment and never reoccur.

I'll update this Issue if I manage to reproduce it locally, but I would appreciate any insights others have.

Expected Result

Should not be possible to get stuck in an infinite recursion within the Sentry SDK modules!

Actual Result

Gets stuck in an infinite recursion in get_traceparent() calls.

message
[2024-07-23 22:07:40,209] ERROR [Request ID: 3c7c2cf9-e453-4cca-813c-f5e57d102f05] in errors: Unhandled exception: maximum recursion depth exceeded while calling a Python object
File "/usr/local/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
...Skipping ahead...
return flask.render_template(
File "/usr/local/lib/python3.9/site-packages/flask/templating.py", line 150, in render_template
return _render(app, template, context)
File "/usr/local/lib/python3.9/site-packages/flask/templating.py", line 128, in _render
before_render_template.send(
File "/usr/local/lib/python3.9/site-packages/blinker/base.py", line 279, in send
result = receiver(sender, **kwargs)
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/integrations/flask.py", line 97, in _add_sentry_trace
trace_meta = Markup(scope.trace_propagation_meta())
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/scope.py", line 572, in trace_propagation_meta
sentry_trace = self.get_traceparent()
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/scope.py", line 514, in get_traceparent
return Scope.get_isolation_scope().get_traceparent()
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/scope.py", line 514, in get_traceparent
return Scope.get_isolation_scope().get_traceparent()
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/scope.py", line 514, in get_traceparent
return Scope.get_isolation_scope().get_traceparent()
[Previous line repeated 2967 more times]
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/scope.py", line 499, in get_traceparent
client = Scope.get_client()
File "/usr/local/lib/python3.9/site-packages/sentry_sdk/scope.py", line 375, in get_client
current_scope = _current_scope.get()

antonpirker commented 3 months ago

Hey @Parthib,

thanks for bringing this up, I will have a look today and try to reproduce this.

antonpirker commented 3 months ago

Some questions:

Before upgrading to 2.5.1, what version did you use?
Do you have enable_tracing=True or traces_sample_rate set to something other then 0 or None?

antonpirker commented 3 months ago

BTW: really great issue description, thank you so much for taking that time! 💯

antonpirker commented 3 months ago

There are two flavors of scopes, the current scope (holding data for a span, which is basically for a block of code) and the isolation scope (holding data for one request-response cycle) We try to read the trace info from the current scopes span, if this is not working we read it from the _trace_propagation_context on the isolation scope.

It seams that when there is a isolation scope without a _trace_propagation_context then you enter the infinite loop. But the _trace_propagation_context is set when the isolation scope is created.

So my guess is now, as you are using ASGI that this is some kind of race condition, that we try to access _trace_propagation_context in the isolation scope before the scope is fully created, or something like that. Will investigate further

antonpirker commented 3 months ago

One more question @Parthib because you said it occures "once per deployment and never repeats". Do you know if it occures when the first user makes a request to Flask, or does it happen before doring startup?

Parthib commented 3 months ago

It seems to occur when our automated testing logic hits one of our Flask service's endpoints which eventually triggers the flask.render_template! So yes when the first user makes a request to Flask to that particular endpoint.

In our logs, I essentially see the server standing up, our health check endpoint being hit, and then the endpoint for triggering the flask.render_template being hit.

antonpirker commented 3 months ago

Ok, I am trying to reproduce this. (If you have a reproduction, this would be great too!)

Parthib commented 3 months ago

I gave it my best to reproduce locally today, but I simply could not. So I tried the next best thing - I forked the sentry sdk repo and added my own log lines: https://github.com/openlayer-ai/sentry-python-debug/commit/bcd6086bdbfc58b7d36eda291364695092accef1

I deployed it to our staging environment where the issue is for sure happening consistently with new deployments. It spammed our logs quite a bit and it turned out to be less helpful than I thought, so I am planning to refine it tomorrow.

Here's some logs around the infinite recursion:

7035
SENTRY SCOPE DEBUG: getting traceparent
7036
SENTRY SCOPE DEBUG: Is self.span None? True
7037
SENTRY SCOPE DEBUG: getting traceparent from isolation scope
7038
SENTRY SCOPE DEBUG: Retrieved isolation scope from context
7039
SENTRY SCOPE DEBUG: getting traceparent
7040
SENTRY SCOPE DEBUG: Is self.span None? True
7041
SENTRY SCOPE DEBUG: getting traceparent from isolation scope
7042
SENTRY SCOPE DEBUG: Retrieved isolation scope from context
7043
SENTRY SCOPE DEBUG: getting traceparent
7044
SENTRY SCOPE DEBUG: Is self.span None? True

Then we eventually recurse too deep

7120
[2024-07-25 23:57:59,472] ERROR [Request ID: 691c60fa-da81-4058-a0d2-7f8cafa7e0a5] in errors: Unhandled exception: maximum recursion depth exceeded while calling a Python object

Interestingly I'm seeing that the request signal is being received AFTER all of these lines...

7135
SENTRY FLASK DEBUG: Request for request started signal received

which is bad because I believe that is what sets up the isolation scope in the first place. Personally I'm sensing a multithreading issue where the thread that calls the function associated with the signal is being blocked until we exit the infinite recursion with an error. Perhaps the limited cores or resources available on our ECS cluster is what is causing the issue to surface there but not on my personal machine. But again that does not necessarily explain why it is happening only once after a new deployment...

Also oddly sometimes the logs print on the same line...

7139
SENTRY SCOPE DEBUG: Retrieved isolation scope from contextSENTRY SCOPE DEBUG: Retrieved isolation scope from context

Unsure what to make of this yet, but if there is anything you want to try feel free to submit your own debugging branch to that repo and I can try it out in our staging environment tomorrow

antonpirker commented 3 months ago

That's very cool!

Some explanation. The first the get_traceparent is called it is on the so called current scope.

First we try to read the trace information from the span (this line: https://github.com/openlayer-ai/sentry-python-debug/commit/bcd6086bdbfc58b7d36eda291364695092accef1#diff-c935006d78056bc518dd6e898d31d972c130c63119df74edf122e2cf4a5733bcR511)
As there is no span, we try to read it from the propagation context (which should always fail, because the propagation context lives on the isolation scope) (this line https://github.com/openlayer-ai/sentry-python-debug/commit/bcd6086bdbfc58b7d36eda291364695092accef1#diff-c935006d78056bc518dd6e898d31d972c130c63119df74edf122e2cf4a5733bcR516)
Then we call the same function in the isolation scope (this line: https://github.com/openlayer-ai/sentry-python-debug/commit/bcd6086bdbfc58b7d36eda291364695092accef1#diff-c935006d78056bc518dd6e898d31d972c130c63119df74edf122e2cf4a5733bcR526)

An back from the top:

trying to read from the span (will always fail, because we are not on the isolation scope, but the span lives on the current scope)
trying to read from the propagation context (this should now always work, because we are on the isolation scope and the propagation context is set in the constructor of the isolation scope) Strangely it is not there. so-
call get_traceparent() from the isolation scope and we are in an infinite recursion.

Things why this could happen:

Somehow get_traceparent() is called from the scopes __init__(). But this does not happen (at least when looking at the code)

Question:

Does yout Flask project also use the request_started signal? I am now thinking maybe some code tries to run some of our SDK stuff, before the SDK is fully initialized...

(I will look into this more next week, need to go into the weekend right now :-) ) Thanks for all the info!

Parthib commented 3 months ago

Thanks for looking into this!

Hm...I searched our codebase and I don't see anything in our Flask project that uses the request_started signal...

szokeasaurusrex commented 3 months ago

@Parthib thanks for that information. It would be super helpful if you could get a minimal reproduction for us; otherwise, it will probably be difficult to help much further without one

Parthib commented 3 months ago

I think I've gotten down to a root cause after managing to reproduce it locally, but I would appreciate any input you might have. We may have been unnecessarily been calling

        @app.after_request
        def clear_sentry_scope(response):
            with sentry_sdk.configure_scope() as scope:
                scope.clear()
            return response

after each one of our requests via that Flask annotation. This would clear the propagation scope associated with the default isolation scope after the first flask render request, and subsequent calls would infinitely recurse as a result. I can provide a minimal example of the issue if needed if that doesn't make sense.

I think the bigger question though is - is my usage of scope.clear() simply incorrect?

szokeasaurusrex commented 3 months ago

@Parthib What are you trying to achieve with this scope.clear() call? I think you can remove it; the SDK should automatically take care of correctly isolating relevant information into the correct scope.

While you likely should simply remove this call since it seems unnecessary, I think it is probably a bug that calling it this way causes infinite recursion. This code does not seem wrong to me, just unnecessary, but calling it should probably not break anything.

Parthib commented 3 months ago

Cool thanks for confirming - we were just trying to ensure that tagging related information was being properly reset between requests. I think in the prior versions of sentry-python, this was a harmless action so we never noticed that it was unnecessary.

At the same time, you're right I think that calling it this way should not cause an infinite recursion. Do you think you need that minimal reproduction of the issue?

szokeasaurusrex commented 3 months ago

Do you think you need that minimal reproduction of the issue?

Yes, it would be extremely helpful if you could provide this 🙂

getsentry / sentry-python