Closed dannosaur closed 1 month ago
The log messages you are seeing are warnings that we introduced some time ago to handle the cases where the runtime would give us seemingly random objects when we expect frame objects. This happens with CPython >= 3.11. We believe this is caused by the new way that the frame stack is maintained by the CPython VM, and by the way that Cython transpiles our thread stack unwinding logic. We've been working towards a more robust solution ever since, and we are now quite close to a new sampling mechanism that will allow us to bypass these problems.
Can I get to clarify one thing, though, please?
What's interesting is that all containers that emit this error fail to boot. In gunicorn's case, the worker is killed with a "Worker timeout" error (then a SIGKILL), and in Celery's case the process just gets shut down.
I gather from this that your application is not working as expected. Is this the case only when it is instrumented with ddtrace
? Because we're handling these "bad object" cases, we do not expect them to cause any issues to the application itself. If the logs feel too noisy we could perhaps downgrade the log messages to debug level.
Also apologies if you have answered this question before, but does your application use gevent
? Just thinking of other causes for the behaviour that you're seeing, in case ddtrace
is at fault here.
Once again, many thanks for your cooperation and patience! 🙏
I haven't dug too much into it, but looking yesterday at the warning messages emitted by the package during spot checks always corresponded with a container failing to start up correctly. Our application works fine, but during scale up events when ECS tries to start a new task, sometimes that task fails to start and it takes it a bit longer to reach desired capacity. It's random, and they start correctly more often than they don't, and that's what makes this even weirder is that it's not reliably repeatable.
Our stack is gunicorn with an async uvicorn worker (so yes, gevent).
This issue has been automatically closed after a period of inactivity. If it's a feature request, it has been added to the maintainers' internal backlog and will be included in an upcoming round of feature prioritization. Please comment or reopen if you think this issue was closed in error.
@P403n1x87 We are also experiencing this issue at Source.ag, causing our lambdas invocations to fail and alarms to ring 🚨 without a clear error in cloudwatch or dd logs.
:wave: hi @j-krose, can you confirm that your problems go away when you disable Profiling (e.g., by setting DD_PROFILING_ENABLED=false
)? Thanks!
Hi @sanchda, I can indeed confirm this for you tomorrow. Right now I am confirming that the issue starts popping up at the border of where we switch to python 3.11 :+1:
We have confirmed that this issue does not happen when using python 3.10 and does happen when using 3.11
I will experiment with using 3.11 with DD_PROFILING_ENABLED=false
and post an update
@sanchda FYI I moved our lambda to one of the older 3.11 versions on which we have frequently seen this error and switched to DD_PROFILING_ENABLED=false
. I left it under operation load like this for ~16 hours and did not see any instances of this error (usually we would see somewhere between 8-16 instances in 16 hours).
So it does look like disabling profiling makes the issue go away, as well as downgrading to python 3.10.
For extra caution, I changed back to DD_PROFILING_ENABLED=true
just to make sure that all-else-equal we do see this failure again. Within two hours another instance of the failure popped up.
Summary of problem
Which version of dd-trace-py are you using?
Currently running a branch made by @P403n1x87
Which version of pip are you using?
24.0
Which libraries and their versions are you using?
See #8132
This error is a bit different to the others as this is emitting a warning log message;
..the list goes on.
The metadata from the log messages suggests that celery might have something to do with this. What's interesting is that all containers that emit this error fail to boot. In gunicorn's case, the worker is killed with a "Worker timeout" error (then a SIGKILL), and in Celery's case the process just gets shut down. This means (I think) that no single request or edge case is causing this particular error, that it's something that our project might be doing on startup, but not always. As we're running in containers, there should be no variation between 2 containers starting up that I can think of.
Unfortunately Sentry isn't expanding the context on the stack trace to show me what
frame
really is beyond its type which is sent in the log message.self
inddtrace/profiling/collector/__init__.py:42
is reported as;This is all the context I have.