DataDog / dd-trace-py

Datadog Python APM Client
https://ddtrace.readthedocs.io/
Other
513 stars 404 forks source link

Segmentation Fault in dd-trace-py on Python 3.12 #9205

Open sampritipanda opened 3 months ago

sampritipanda commented 3 months ago

Summary of problem

I'm getting frequent segmentation faults in my application after I started to use Python 3.12

Which version of dd-trace-py are you using?

ddtrace = "^2.8.1"

How can we reproduce your problem?

Not very reproducible so far.

What is the result that you get?

Here's a stack trace of the segfault from one of the core dumps generated:

The docker container we use is: thehale/python-poetry:1.8.2-py3.12-slim

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=11, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007f2d1c47ae8f in __pthread_kill_internal (signo=11, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007f2d1c42bfb2 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#3  <signal handler called>
#4  0x00007f2d1c7f656c in ?? () from /usr/local/bin/../lib/libpython3.12.so.1.0
#5  0x00007f2ceccfec71 in memalloc_malloc () from /.venv/lib/python3.12/site-packages/ddtrace/profiling/collector/_memalloc.cpython-312-x86_64-linux-gnu.so
#6  0x00007f2d1c7fdfd1 in PyUnicode_New () from /usr/local/bin/../lib/libpython3.12.so.1.0
#7  0x00007f2d1c7fd907 in ?? () from /usr/local/bin/../lib/libpython3.12.so.1.0
#8  0x00007f2d1c8900a2 in _PyErr_SetString () from /usr/local/bin/../lib/libpython3.12.so.1.0
#9  0x00007f2d1c82f53b in PyLong_AsLong () from /usr/local/bin/../lib/libpython3.12.so.1.0
#10 0x00007f2cec02c457 in ?? () from /.venv/lib/python3.12/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-312-x86_64-linux-gnu.so
#11 0x00007f2cec02ca67 in Frame::Frame(PyCodeObject*, int) () from /.venv/lib/python3.12/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-312-x86_64-linux-gnu.so
#12 0x00007f2cec02cb6a in Frame::get(PyCodeObject*, int) () from /.venv/lib/python3.12/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-312-x86_64-linux-gnu.so
#13 0x00007f2cec02cd6b in Frame::read(_object*, _object**) () from /.venv/lib/python3.12/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-312-x86_64-linux-gnu.so
#14 0x00007f2cec02ce2c in ?? () from /.venv/lib/python3.12/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-312-x86_64-linux-gnu.so
#15 0x00007f2cec02cf1a in ?? () from /.venv/lib/python3.12/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-312-x86_64-linux-gnu.so
#16 0x00007f2cec02f37a in ThreadInfo::unwind(_ts*) () from /.venv/lib/python3.12/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-312-x86_64-linux-gnu.so
#17 0x00007f2cec02fcc1 in ThreadInfo::sample(long, _ts*, unsigned long) () from /.venv/lib/python3.12/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-312-x86_64-linux-gnu.so
#18 0x00007f2cec02e186 in ?? () from /.venv/lib/python3.12/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-312-x86_64-linux-gnu.so
#19 0x00007f2cec02e2bc in ?? () from /.venv/lib/python3.12/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-312-x86_64-linux-gnu.so
#20 0x00007f2cec02ba3d in ?? () from /.venv/lib/python3.12/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-312-x86_64-linux-gnu.so
#21 0x00007f2cec02bac0 in Datadog::Sampler::sampling_thread(unsigned long) () from /.venv/lib/python3.12/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-312-x86_64-linux-gnu.so
#22 0x00007f2cec034ac0 in ?? () from /.venv/lib/python3.12/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-312-x86_64-linux-gnu.so
#23 0x00007f2d1c479134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#24 0x00007f2d1c4f97dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

It's somewhat unclear what causes this segfault, but all of our segfaults have this exact same stack trace. I can share some code pointers which each stack frame refers to that I found while trying to root cause this:

4 - (I believe this the python allocator malloc implementation) 5 - https://github.com/DataDog/dd-trace-py/blob/main/ddtrace/profiling/collector/_memalloc.c#L116 9 - https://github.com/python/cpython/blob/3.12/Objects/longobject.c#L542-L543 10 - https://github.com/P403n1x87/echion/blob/main/echion/strings.h#L104

What is the result that you expected?

No segfaults pls πŸ˜„ Really like the product otherwise.

sanchda commented 3 months ago

:wave: Thank you for the report! Unfortunately, this is a known problem in the "stack v2" implementation of the profiler on Python 3.12 (it does not occur in 3.11 or earlier). If you haven't tried the "legacy" stack collector (just omit the DD_PROFILING_STACK_V2_ENABLED environment variable, or use DD_PROFILING_STACK_V2_ENABLED=false), then please give it a shot and see if it offers you some relief.

If you're using "stack v2" for a reason (such as, avoiding the even greater number of segfaults originating from the cpython runtime for the legacy stack collector), then please ignore that advice. πŸ˜„

This may actually have been fixed in ddtrace 3.8.4. I'll be testing and working on a fix this upcoming week. I'll check back in on Math 15th or so to confirm whether that version actually has the fix. If you'd like to try the new release to see if it helps, please let me know how that goes!

sampritipanda commented 3 months ago

Thanks, I indeed had DD_PROFILING_STACK_V2_ENABLED and DD_PROFILING_EXPORT_LIBDD_ENABLED enabled to mitigate segfaults from Python 3.11. Should I disable both of them or just STACK_V2?

sanchda commented 2 months ago

:wave: sorry, I'm not sure why I lost track of this thread.

If you're using stack v2 in order to mitigate segfaults, then unfortunately there's not much relief. Working on a fix.