Open Parthib opened 3 months ago
Hey @Parthib,
thanks for bringing this up, I will have a look today and try to reproduce this.
Some questions:
enable_tracing=True
or traces_sample_rate
set to something other then 0
or None
?BTW: really great issue description, thank you so much for taking that time! 💯
There are two flavors of scopes, the current scope (holding data for a span, which is basically for a block of code) and the isolation scope (holding data for one request-response cycle)
We try to read the trace info from the current scopes span, if this is not working we read it from the _trace_propagation_context
on the isolation scope.
It seams that when there is a isolation scope without a _trace_propagation_context
then you enter the infinite loop. But the _trace_propagation_context
is set when the isolation scope is created.
So my guess is now, as you are using ASGI that this is some kind of race condition, that we try to access _trace_propagation_context
in the isolation scope before the scope is fully created, or something like that. Will investigate further
One more question @Parthib because you said it occures "once per deployment and never repeats". Do you know if it occures when the first user makes a request to Flask, or does it happen before doring startup?
It seems to occur when our automated testing logic hits one of our Flask service's endpoints which eventually triggers the flask.render_template
! So yes when the first user makes a request to Flask to that particular endpoint.
In our logs, I essentially see the server standing up, our health check endpoint being hit, and then the endpoint for triggering the flask.render_template
being hit.
Ok, I am trying to reproduce this. (If you have a reproduction, this would be great too!)
I gave it my best to reproduce locally today, but I simply could not. So I tried the next best thing - I forked the sentry sdk repo and added my own log lines: https://github.com/openlayer-ai/sentry-python-debug/commit/bcd6086bdbfc58b7d36eda291364695092accef1
I deployed it to our staging environment where the issue is for sure happening consistently with new deployments. It spammed our logs quite a bit and it turned out to be less helpful than I thought, so I am planning to refine it tomorrow.
Here's some logs around the infinite recursion:
7035
SENTRY SCOPE DEBUG: getting traceparent
7036
SENTRY SCOPE DEBUG: Is self.span None? True
7037
SENTRY SCOPE DEBUG: getting traceparent from isolation scope
7038
SENTRY SCOPE DEBUG: Retrieved isolation scope from context
7039
SENTRY SCOPE DEBUG: getting traceparent
7040
SENTRY SCOPE DEBUG: Is self.span None? True
7041
SENTRY SCOPE DEBUG: getting traceparent from isolation scope
7042
SENTRY SCOPE DEBUG: Retrieved isolation scope from context
7043
SENTRY SCOPE DEBUG: getting traceparent
7044
SENTRY SCOPE DEBUG: Is self.span None? True
Then we eventually recurse too deep
7120
[2024-07-25 23:57:59,472] ERROR [Request ID: 691c60fa-da81-4058-a0d2-7f8cafa7e0a5] in errors: Unhandled exception: maximum recursion depth exceeded while calling a Python object
Interestingly I'm seeing that the request signal is being received AFTER all of these lines...
7135
SENTRY FLASK DEBUG: Request for request started signal received
which is bad because I believe that is what sets up the isolation scope in the first place. Personally I'm sensing a multithreading issue where the thread that calls the function associated with the signal is being blocked until we exit the infinite recursion with an error. Perhaps the limited cores or resources available on our ECS cluster is what is causing the issue to surface there but not on my personal machine. But again that does not necessarily explain why it is happening only once after a new deployment...
Also oddly sometimes the logs print on the same line...
7139
SENTRY SCOPE DEBUG: Retrieved isolation scope from contextSENTRY SCOPE DEBUG: Retrieved isolation scope from context
Unsure what to make of this yet, but if there is anything you want to try feel free to submit your own debugging branch to that repo and I can try it out in our staging environment tomorrow
That's very cool!
Some explanation. The first the get_traceparent
is called it is on the so called current scope.
First we try to read the trace information from the span (this line: https://github.com/openlayer-ai/sentry-python-debug/commit/bcd6086bdbfc58b7d36eda291364695092accef1#diff-c935006d78056bc518dd6e898d31d972c130c63119df74edf122e2cf4a5733bcR511)
As there is no span, we try to read it from the propagation context (which should always fail, because the propagation context lives on the isolation scope) (this line https://github.com/openlayer-ai/sentry-python-debug/commit/bcd6086bdbfc58b7d36eda291364695092accef1#diff-c935006d78056bc518dd6e898d31d972c130c63119df74edf122e2cf4a5733bcR516)
Then we call the same function in the isolation scope (this line: https://github.com/openlayer-ai/sentry-python-debug/commit/bcd6086bdbfc58b7d36eda291364695092accef1#diff-c935006d78056bc518dd6e898d31d972c130c63119df74edf122e2cf4a5733bcR526)
An back from the top:
trying to read from the span (will always fail, because we are not on the isolation scope, but the span lives on the current scope)
trying to read from the propagation context (this should now always work, because we are on the isolation scope and the propagation context is set in the constructor of the isolation scope) Strangely it is not there. so-
call get_traceparent() from the isolation scope and we are in an infinite recursion.
Things why this could happen:
Somehow get_traceparent()
is called from the scopes __init__()
. But this does not happen (at least when looking at the code)
Question:
Does yout Flask project also use the request_started
signal? I am now thinking maybe some code tries to run some of our SDK stuff, before the SDK is fully initialized...
(I will look into this more next week, need to go into the weekend right now :-) ) Thanks for all the info!
Thanks for looking into this!
Hm...I searched our codebase and I don't see anything in our Flask project that uses the request_started
signal...
@Parthib thanks for that information. It would be super helpful if you could get a minimal reproduction for us; otherwise, it will probably be difficult to help much further without one
I think I've gotten down to a root cause after managing to reproduce it locally, but I would appreciate any input you might have. We may have been unnecessarily been calling
@app.after_request
def clear_sentry_scope(response):
with sentry_sdk.configure_scope() as scope:
scope.clear()
return response
after each one of our requests via that Flask annotation. This would clear the propagation scope associated with the default isolation scope after the first flask render request, and subsequent calls would infinitely recurse as a result. I can provide a minimal example of the issue if needed if that doesn't make sense.
I think the bigger question though is - is my usage of scope.clear() simply incorrect?
@Parthib What are you trying to achieve with this scope.clear()
call? I think you can remove it; the SDK should automatically take care of correctly isolating relevant information into the correct scope.
While you likely should simply remove this call since it seems unnecessary, I think it is probably a bug that calling it this way causes infinite recursion. This code does not seem wrong to me, just unnecessary, but calling it should probably not break anything.
Cool thanks for confirming - we were just trying to ensure that tagging related information was being properly reset between requests. I think in the prior versions of sentry-python
, this was a harmless action so we never noticed that it was unnecessary.
At the same time, you're right I think that calling it this way should not cause an infinite recursion. Do you think you need that minimal reproduction of the issue?
Do you think you need that minimal reproduction of the issue?
Yes, it would be extremely helpful if you could provide this 🙂
How do you use Sentry?
Sentry Saas (sentry.io)
Version
2.11.0
Steps to Reproduce
I have been noticing an infinite recursion issue within the Sentry SDK popping up in our staging/production logs, but I am still in the process of trying to reproduce it locally. It seems to happen only once per fresh deployment.
Some background that I think is relevant context:
FlaskIntegration()
Here is the relevant part of a sample stacktrace I am seeing (I redacted it a bit to cut to the chase):
I did some research as to what is going on here, and it looks like Flask defines a signal called before_render_template that calls all registered functions right before a flask template is rendered via a call to
flask.render_template
. Looks like sentry registers the function_add_sentry_trace
with this signal.When the signal is called, _add_sentry_trace is called, which
scope.trace_propagation_meta()
self.get_traceparent()
Scope.get_isolation_scope().get_traceparent()
where it finally gets stuck in an infinite recursion of calling
get_traceparent()
My attempts to reproduce locally are ongoing. Locally, I can't seem to get it to hit that
return Scope.get_isolation_scope().get_traceparent()
case. Usually,self.span
is defined andget_traceparent
exits here.When running with gunicorn, it looks like
self.span
doesn't get defined initially, but the function still manages to safely return within this conditional.Even in our staging/production environment, we only see the issue occur once per deployment and never reoccur.
I'll update this Issue if I manage to reproduce it locally, but I would appreciate any insights others have.
Expected Result
Should not be possible to get stuck in an infinite recursion within the Sentry SDK modules!
Actual Result
Gets stuck in an infinite recursion in
get_traceparent()
calls.