Closed kravnaut closed 4 months ago
Thank you for the detailed writeup, @kravnaut.
My first instinct is that this could be related to https://github.com/DataDog/dd-trace-py/issues/5993.
Hello π We are experiencing a bit of similar issues after upgrading to Python 3.11. On Python 3.10 everything works without issues, but once we upgrade to Python 3.11, we're experiencing regular SIGSEGVs on websocket servers based on daphne
library. I tried to debug it further with faulthandler,
inspecting core dump or turning off ddtrace
s debug, but unfortunately didn't find anything useful there.
We believe the issue is caused be ddtrace
because:
ddtrace
is disabled. I'm happy to share more context around our specific setup, so please let me know if there's any information that you'll find useful. Below you can find some generic information about versions, etc
OS: Debian GNU/Linux 11 (bullseye)
Kernel: 5.4.249-163.359.amzn2.aarch64
Python version: 3.11.4
ddtrace version: 1.17.3
Also, let me know if I should turn it into a separate issue as this one is referencing psycopg
specifically in the title.
I can confirm we're experiencing the same issue since migrating our Lambda functions to Python 3.11.
What we also noticed is that the segfault only happens if there was an unhandled exception during function execution, and the Lambda handler function has a decorator. It can be any decorator, even a simple no-op one. Our debugging efforts only lead us to find out that the issue occurs somewhere deep within ddtrace and setting DD_TRACE_ENABLED
to false
is the only "solution" we've found so far.
We're seeing the same thing in multiple services across multiple versions of python 3.11 and dd-trace. We don't use psycopg in any of those services.
@kravnaut @mLupine @pgrzesik @WarpRat May I check with you whether the profiler is also enabled? We are aware of similar issues that might be caused by the profiler on CPython 3.11. So if you have the profiler enabled, would disabling it solve your problem? We expect this to be the case, but would love your confirmation π .
Hey @P403n1x87 - yes, we came to the same conclusion when talking with support directly. Unfortunately, turning off a useful feature is not the way to go in the long run. Recently, we've been checking if that issue has been solved in Python 3.11.5, but that didn't help. Our current hypothesis is that the bug might be actually in CPython implementation, but the reproduction is not necessarily easy.
Unfortunately, turning off a useful feature is not the way to go in the long run.
Absolutely. We're actively working on our side to get this resolved ASAP.
Our current hypothesis is that the bug might be actually in CPython implementation, but the reproduction is not necessarily easy.
Indeed this is also what our investigation is leaning towards, but we are committed to getting to the bottom of this so that we can provide a product that works. Please bear with us π .
@P403n1x87 Weβve been seeing Seg faults with python 3.11 and ddtrace frequently with our Django application. Took about a week of configuration testing to determine it was the profiler. Since disabling the profiler weβve had no issues.
Please disclose these things in the known issues in the future so we can minimize the time we need to troubleshoot configuration issues.
@nat45928 I'm working with @P403n1x87 on this issue--I acknowledge that diagnosing the original cause for a segmentation fault can be a time-intensive experience. I think it's doubly frustrating because an observability tool managed by an organization like Datadog isn't (yet!) everyone's first suspicion for a segfault. Sorry for that. This is a new and fast-moving issue; in our other investigation, the understanding was that it was highly circumstantial to the customer's setup. I think this thread is demonstrating this is not the case. I'll raise the idea of marking this as a known issue--appreciate the thought!
:wave: Another dev from Datadog's Profiling team here.
Several people in this thread have identified some extremely useful circumstantial evidence for the defect, but it would be extremely useful to gain a little bit more context into the source of the crash. If anyone has the time/flexibility to do so, it would help us remediate this issue more quickly if you could perform a diagnostic operation.
Instructions are here: https://github.com/DataDog/dd-trace-py/issues/6701#issuecomment-1717659333. Posting similar information as the responder in that other thread would be glorious. Thank you!
@sanchda Before discovering this thread we were working towards being able to capture core dumps for further troublesshooting. Would privately sharing one of those be sufficient or is there additional information you'd get from adding that binary to our containers?
@WarpRat, a corefile is better than my suggestion. That'd be wonderful.
Typically, submitting corefiles etc is something we do through our Support organization. You can submit a ticket at https://help.datadoghq.com/hc/en-us/requests/new. This is slightly annoying--I apologize for the extra step, but it would really, really help. If you mention Profiling (and possibly link this conversation), it should ping my team. This will give you a confidential channel to discuss the particular defect at your organization, plus it'll allow me to look at your organization's Datadog installation directly, which can have a lot of useful circumstantial information.
In order for me to make full use of a corefile, it would help to have some additional information:
pip freeze
or similarMany, many thanks! If this is just too much extra work for you to do, I understand--let me know and we'll try to figure out the next-best-way of iterating.
Updating the issue to point to myself and to indicate the faulty component as Profiling. We're still not 100% sure that Profiling is a fundamental cause, but we're the side that will own coordination on this end.
Thank you for your patience, everyone!
Hi everyone,
Still working on this one. We're investigating a few angles related to wrapping. Is anyone able to test setting the WRAPT_DISABLE_EXTENSIONS=true
environment variable? I'm maybe ~35% confident this will help anything, but it should be a quick and easy test.
Thanks
I just tried the WRAPT_DISABLE_EXTENSIONS
setting in my environment and still saw segmentation faults (~30min intervals).
I'm on K8s (Microk8s, Ubuntu 22.04.1, Linux 5.15.0-58-generic) with a Python Django workload running via Daphne.
@nat45928 many thanks for trying that out! That eliminates one potential angle.
Hi again, everyone.
I have an x86_64+glibc release of ddtrace available, if anyone is interested in trying it out here. I had to host it out of a personal repo (sorry!), since we don't have a great workflow for sharing out-of-band custom builds.
Here's the URL. You'll have to consult with the docs of your particular package management system to see how to consume this. https://github.com/sanchda/test_gcr/releases/download/v1.16.1_innerapi/ddtrace-2.1.0.dev34+g9bd1fb9bb-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Another thing that would be useful is setting the PYTHONFAULTHANDLER
environment variable to true
. This will surface the Python context at time-of-crash, which will give us some hints to go by.
I'm not very optimistic that this will fix the problem for anyone, but it might yield some useful diagnostic information. Thank you for your patience.
@kravnaut @mLupine @pgrzesik @WarpRat May I check with you whether the profiler is also enabled? We are aware of similar issues that might be caused by the profiler on CPython 3.11. So if you have the profiler enabled, would disabling it solve your problem? We expect this to be the case, but would love your confirmation π .
We are not using profiler. And as an update, after python 3.11 upgrade we also started to get sporadic segmentation faults...
Thanks for confirming @kravnaut. Your original issue is likely related to tracing. Anybody else using OpenAI and tracing are affected by an issue that should be resolved by #7044. There still is a chance that the same PR fixes the original issue, since the fix is for a core component also used in other tracing integrations. So it would be worth trying the latest release with the fix that should come out soon. If that doesn't fix it, the tracing team will keep investigating this.
We are using ddtrace 2.0.2 with profiling enabled, and see segfaults. With profiling disabled they go away.
Not using OpenAI, we are using psycopg2
. Will try and get PYTHONFAULTHANDLER
set to see if we can get y'all some logs.
Hi - chiming in here to note that we're also seeing segfaults across both our instances running celery and those serving web requests via uvicorn (ddtrace 2.0.2). Turning off profiling appears to resolve the issue.
Still occurring after 2.1.1. Having trouble getting PYTHONFAULTHANDLER
output, the way it logs it seems to not behave well with the way dd-agent gets logs from docker containers?
@jlucas91 @shughes-uk Thanks for the extra reports. Would you be able to test whether turning the tracer off (or just the psycopg integration) while keeping the profiler on also resolves the issue?
I can try that, our production environment runs with profiling off and does not encounter these segfaults though.
I managed to get the output from one of the segfaults and it looks like the relevant part is
File "/opt/coiled/env/lib/python3.11/site-packages/ddtrace/profiling/collector/__init__.py", line 42 in periodic
The only other thread not just sitting on threading.wait
was our code that checks memory usage, it was on a line that opens a file handler to /sys/fs/cgroup/memory.current
so we can have a quick look at what cgroup
says our memory usage is. Not sure if y'all interact with that or not.
@kravnaut @mLupine @pgrzesik @WarpRat May I check with you whether the profiler is also enabled? We are aware of similar issues that might be caused by the profiler on CPython 3.11. So if you have the profiler enabled, would disabling it solve your problem? We expect this to be the case, but would love your confirmation π .
Sorry for the late reply here. We are (were) using profiling and can confirm that disabling it resolves the seg fault issue. It has been an acceptable work around but we're definitely eager to reenable profiling when we can. We're also using fastapi/uvicorn for all the services where we see this. This looked promising, but we're on 3.11 not 3.12 - will that fix potentially address this issue on 3.11 too?
@WarpRat many thanks for your feedback and patience. The PR you mentioned, and that we recently merged, affects CPython 3.11 too, so it is definitely worth a shot. We currently have work in progress to further improve our support of the latest CPython releases, that is 3.11 and 3.12, so we expect to get to a position where this and similar issues are resolved with confidence in the coming months.
This issue has been automatically closed after a period of inactivity. If it's a feature request, it has been added to the maintainers' internal backlog and will be included in an upcoming round of feature prioritization. Please comment or reopen if you think this issue was closed in error.
Summary of problem
For a few weeks we started seeing sporadic SIGSEGV/SIGABRT in our celery deployment as well as gunicorn app that uses sqlalchemy over psycopg to talk to postgres. We kept trying to narrow down the related change in our infrastructure/code. We added faulthandler and noticed that most of the crashes occurred within the tracing of the psycopg call. That itself is not super surprising as we are heavily using postgres through sqlalchemy+psycopg. However, after we decided to disable specifically psycopg from tracing by using
DD_PATCH_MODULES="psycopg:false"
the errors vanished in both celery and gunicorn deployment.Which version of dd-trace-py are you using?
We were on version ddtrace==1.4.4 when we saw the issue. We've upgraded to ddtrace==1.14.0 and there was no difference - same errors.
Which version of pip are you using?
23.1.2
Which libraries and their versions are you using?
Quite a few, we can through customer support channel if needed
How can we reproduce your problem?
Probably not easy to reproduce - issue is occurring with some frequency but with consistent load. It only happens in production env where we put enough volume and variety of requests.
What is the result that you get?
Ultimately we get SIGSEGV/SIGABRT with various tracebacks that tend to be somewhere in ddtrace typically. Some generic errors implying malloc errors,
Some errors clearly implying memory corruption in random places:
few instances of
double free or corruption (fasttop)
andmalloc(): unsorted double linked list corrupted
etc - garden varietyWhat is the result that you expected?
Not this π