DataDog / dd-trace-py

Datadog Python APM Client
https://ddtrace.readthedocs.io/
Other
543 stars 411 forks source link

AttributeError: 'dict' object has no attribute '__name__' #8427

Closed dannosaur closed 1 month ago

dannosaur commented 7 months ago

Summary of problem

  File "ddtrace/internal/periodic.py", line 56, in run
    self._target()
  File "ddtrace/profiling/collector/__init__.py", line 42, in periodic
    for events in self.collect():
  File "ddtrace/profiling/collector/stack.pyx", line 511, in ddtrace.profiling.collector.stack.StackCollector.collect
    all_events = stack_collect(
  File "ddtrace/profiling/collector/stack.pyx", line 351, in ddtrace.profiling.collector.stack.stack_collect
    frames, nframes = _traceback.pyframe_to_frames(thread_pyframes, max_nframes)
  File "ddtrace/profiling/collector/_traceback.pyx", line 55, in ddtrace.profiling.collector._traceback.pyframe_to_frames
    cpdef pyframe_to_frames(frame, max_nframes):
  File "ddtrace/profiling/collector/_traceback.pyx", line 79, in ddtrace.profiling.collector._traceback.pyframe_to_frames
    "Got object of type '%s' instead of a frame object during stack unwinding", type(frame).__name__

Which version of dd-trace-py are you using?

Currently running a branch made by @P403n1x87

Which version of pip are you using?

24.0

Which libraries and their versions are you using?

See #8132

This error is a bit different to the others as this is emitting a warning log message;

Feb 15 10:27:12.427 Got object of type 'cell' instead of a frame object during stack unwinding
Feb 15 10:25:23.357 Got object of type 'dict' instead of a frame object during stack unwinding
Feb 15 10:25:12.461 Got object of type 'str' instead of a frame object during stack unwinding
Feb 15 10:25:12.000 Got object of type 'str' instead of a frame object during stack unwinding
Feb 15 10:19:32.271 Got object of type 'tuple' instead of a frame object during stack unwinding
Feb 15 10:14:58.789 Got object of type 'str' instead of a frame object during stack unwinding
Feb 15 10:14:58.000 Got object of type 'str' instead of a frame object during stack unwinding
Feb 15 10:12:46.032 Got object of type 'dict' instead of a frame object during stack unwinding
Feb 15 10:12:46.000 Got object of type 'dict' instead of a frame object during stack unwinding

..the list goes on.

The metadata from the log messages suggests that celery might have something to do with this. What's interesting is that all containers that emit this error fail to boot. In gunicorn's case, the worker is killed with a "Worker timeout" error (then a SIGKILL), and in Celery's case the process just gets shut down. This means (I think) that no single request or edge case is causing this particular error, that it's something that our project might be doing on startup, but not always. As we're running in containers, there should be no variation between 2 containers starting up that I can think of.

Unfortunately Sentry isn't expanding the context on the stack trace to show me what frame really is beyond its type which is sent in the log message.

self in ddtrace/profiling/collector/__init__.py:42 is reported as;

StackCollector(status=<ServiceStatus.RUNNING: 'running'>, recorder=Recorder(default_max_events=16384, max_events={<class 'ddtrace.profiling.collector.stack_event.StackSampleEvent'>: 30000, <class 'ddtrace.profiling.collector.stack_event.StackExceptionSampleEvent'>: 15000, <class 'ddtrace.profiling.collector.memalloc.MemoryAllocSampleEvent'>: 1920, <class 'ddtrace.profiling.collector.memalloc.MemoryHeapSampleEvent'>: None}), min_interval_time=0.01, max_time_usage_pct=1.0, nframes=64, ignore_profiler=False, endpoint_collection_enabled=True, tracer=<ddtrace.tracer.Tracer object at 0x7fb977159590>)

This is all the context I have.

P403n1x87 commented 7 months ago

The log messages you are seeing are warnings that we introduced some time ago to handle the cases where the runtime would give us seemingly random objects when we expect frame objects. This happens with CPython >= 3.11. We believe this is caused by the new way that the frame stack is maintained by the CPython VM, and by the way that Cython transpiles our thread stack unwinding logic. We've been working towards a more robust solution ever since, and we are now quite close to a new sampling mechanism that will allow us to bypass these problems.

Can I get to clarify one thing, though, please?

What's interesting is that all containers that emit this error fail to boot. In gunicorn's case, the worker is killed with a "Worker timeout" error (then a SIGKILL), and in Celery's case the process just gets shut down.

I gather from this that your application is not working as expected. Is this the case only when it is instrumented with ddtrace? Because we're handling these "bad object" cases, we do not expect them to cause any issues to the application itself. If the logs feel too noisy we could perhaps downgrade the log messages to debug level.

Also apologies if you have answered this question before, but does your application use gevent? Just thinking of other causes for the behaviour that you're seeing, in case ddtrace is at fault here.

Once again, many thanks for your cooperation and patience! 🙏

dannosaur commented 7 months ago

I haven't dug too much into it, but looking yesterday at the warning messages emitted by the package during spot checks always corresponded with a container failing to start up correctly. Our application works fine, but during scale up events when ECS tries to start a new task, sometimes that task fails to start and it takes it a bit longer to reach desired capacity. It's random, and they start correctly more often than they don't, and that's what makes this even weirder is that it's not reliably repeatable.

Our stack is gunicorn with an async uvicorn worker (so yes, gevent).

github-actions[bot] commented 1 month ago

This issue has been automatically closed after a period of inactivity. If it's a feature request, it has been added to the maintainers' internal backlog and will be included in an upcoming round of feature prioritization. Please comment or reopen if you think this issue was closed in error.

j-krose commented 2 weeks ago

@P403n1x87 We are also experiencing this issue at Source.ag, causing our lambdas invocations to fail and alarms to ring 🚨 without a clear error in cloudwatch or dd logs.

sanchda commented 2 weeks ago

:wave: hi @j-krose, can you confirm that your problems go away when you disable Profiling (e.g., by setting DD_PROFILING_ENABLED=false)? Thanks!

j-krose commented 2 weeks ago

Hi @sanchda, I can indeed confirm this for you tomorrow. Right now I am confirming that the issue starts popping up at the border of where we switch to python 3.11 :+1:

j-krose commented 2 weeks ago

We have confirmed that this issue does not happen when using python 3.10 and does happen when using 3.11

I will experiment with using 3.11 with DD_PROFILING_ENABLED=false and post an update

j-krose commented 2 weeks ago

@sanchda FYI I moved our lambda to one of the older 3.11 versions on which we have frequently seen this error and switched to DD_PROFILING_ENABLED=false. I left it under operation load like this for ~16 hours and did not see any instances of this error (usually we would see somewhere between 8-16 instances in 16 hours).

So it does look like disabling profiling makes the issue go away, as well as downgrading to python 3.10.

j-krose commented 2 weeks ago

For extra caution, I changed back to DD_PROFILING_ENABLED=true just to make sure that all-else-equal we do see this failure again. Within two hours another instance of the failure popped up.