Memory corruption and SIGSEGV/SIGABRT with patching of psycopg

kravnaut commented 1 year ago

Summary of problem

For a few weeks we started seeing sporadic SIGSEGV/SIGABRT in our celery deployment as well as gunicorn app that uses sqlalchemy over psycopg to talk to postgres. We kept trying to narrow down the related change in our infrastructure/code. We added faulthandler and noticed that most of the crashes occurred within the tracing of the psycopg call. That itself is not super surprising as we are heavily using postgres through sqlalchemy+psycopg. However, after we decided to disable specifically psycopg from tracing by using DD_PATCH_MODULES="psycopg:false" the errors vanished in both celery and gunicorn deployment.

Which version of dd-trace-py are you using?

We were on version ddtrace==1.4.4 when we saw the issue. We've upgraded to ddtrace==1.14.0 and there was no difference - same errors.

Which version of pip are you using?

23.1.2

Which libraries and their versions are you using?

Quite a few, we can through customer support channel if needed

How can we reproduce your problem?

Probably not easy to reproduce - issue is occurring with some frequency but with consistent load. It only happens in production env where we put enough volume and variety of requests.

What is the result that you get?

Ultimately we get SIGSEGV/SIGABRT with various tracebacks that tend to be somewhere in ddtrace typically. Some generic errors implying malloc errors,

File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2773 in all
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2916 in _iter
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1717 in execute
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1710 in _execute_20
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 334 in _execute_on_connection
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1577 in _execute_clauseelement
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1905 in _execute_context
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736 in do_execute
File "/usr/local/lib/python3.10/site-packages/ddtrace/contrib/dbapi/__init__.py", line 155 in execute
File "/usr/local/lib/python3.10/site-packages/ddtrace/contrib/psycopg/cursor.py", line 14 in _trace_method
File "/usr/local/lib/python3.10/site-packages/ddtrace/contrib/dbapi/__init__.py", line 91 in _trace_method
File "/usr/local/lib/python3.10/site-packages/ddtrace/tracer.py", line 842 in trace
File "/usr/local/lib/python3.10/site-packages/ddtrace/tracer.py", line 724 in _start_span
File "/usr/local/lib/python3.10/site-packages/ddtrace/provider.py", line 129 in activate

Some errors clearly implying memory corruption in random places:

...
  File "/usr/local/lib/python3.10/uuid.py", line 268, in __hash__
    return hash(self.int)
AttributeError: 'UUID' object has no attribute 'int'

few instances of double free or corruption (fasttop) and malloc(): unsorted double linked list corrupted etc - garden variety

What is the result that you expected?

Not this 😆

emmettbutler commented 1 year ago

Thank you for the detailed writeup, @kravnaut.

My first instinct is that this could be related to https://github.com/DataDog/dd-trace-py/issues/5993.

pgrzesik commented 1 year ago

Hello 👋 We are experiencing a bit of similar issues after upgrading to Python 3.11. On Python 3.10 everything works without issues, but once we upgrade to Python 3.11, we're experiencing regular SIGSEGVs on websocket servers based on daphne library. I tried to debug it further with faulthandler, inspecting core dump or turning off ddtraces debug, but unfortunately didn't find anything useful there.

We believe the issue is caused be ddtrace because:

Turning off ddtrace resolves the problem. Turning it on again results in regular segfaults (roughly at least 1 segfault per hour in our case)
Never encountered and didn't manage to succesfully reproduce in local/development environments where ddtrace is disabled.

I'm happy to share more context around our specific setup, so please let me know if there's any information that you'll find useful. Below you can find some generic information about versions, etc

OS: Debian GNU/Linux 11 (bullseye)
Kernel: 5.4.249-163.359.amzn2.aarch64
Python version: 3.11.4
ddtrace version: 1.17.3

Also, let me know if I should turn it into a separate issue as this one is referencing psycopg specifically in the title.

mLupine commented 1 year ago

I can confirm we're experiencing the same issue since migrating our Lambda functions to Python 3.11.

What we also noticed is that the segfault only happens if there was an unhandled exception during function execution, and the Lambda handler function has a decorator. It can be any decorator, even a simple no-op one. Our debugging efforts only lead us to find out that the issue occurs somewhere deep within ddtrace and setting DD_TRACE_ENABLED to false is the only "solution" we've found so far.

WarpRat commented 1 year ago

We're seeing the same thing in multiple services across multiple versions of python 3.11 and dd-trace. We don't use psycopg in any of those services.

P403n1x87 commented 1 year ago

@kravnaut @mLupine @pgrzesik @WarpRat May I check with you whether the profiler is also enabled? We are aware of similar issues that might be caused by the profiler on CPython 3.11. So if you have the profiler enabled, would disabling it solve your problem? We expect this to be the case, but would love your confirmation 🙏 .

pgrzesik commented 1 year ago

Hey @P403n1x87 - yes, we came to the same conclusion when talking with support directly. Unfortunately, turning off a useful feature is not the way to go in the long run. Recently, we've been checking if that issue has been solved in Python 3.11.5, but that didn't help. Our current hypothesis is that the bug might be actually in CPython implementation, but the reproduction is not necessarily easy.

P403n1x87 commented 1 year ago

Unfortunately, turning off a useful feature is not the way to go in the long run.

Absolutely. We're actively working on our side to get this resolved ASAP.

Our current hypothesis is that the bug might be actually in CPython implementation, but the reproduction is not necessarily easy.

Indeed this is also what our investigation is leaning towards, but we are committed to getting to the bottom of this so that we can provide a product that works. Please bear with us 🙏 .

nat45928 commented 1 year ago

@P403n1x87 We’ve been seeing Seg faults with python 3.11 and ddtrace frequently with our Django application. Took about a week of configuration testing to determine it was the profiler. Since disabling the profiler we’ve had no issues.

Please disclose these things in the known issues in the future so we can minimize the time we need to troubleshoot configuration issues.

sanchda commented 1 year ago

@nat45928 I'm working with @P403n1x87 on this issue--I acknowledge that diagnosing the original cause for a segmentation fault can be a time-intensive experience. I think it's doubly frustrating because an observability tool managed by an organization like Datadog isn't (yet!) everyone's first suspicion for a segfault. Sorry for that. This is a new and fast-moving issue; in our other investigation, the understanding was that it was highly circumstantial to the customer's setup. I think this thread is demonstrating this is not the case. I'll raise the idea of marking this as a known issue--appreciate the thought!

sanchda commented 1 year ago

:wave: Another dev from Datadog's Profiling team here.

Several people in this thread have identified some extremely useful circumstantial evidence for the defect, but it would be extremely useful to gain a little bit more context into the source of the crash. If anyone has the time/flexibility to do so, it would help us remediate this issue more quickly if you could perform a diagnostic operation.

Instructions are here: https://github.com/DataDog/dd-trace-py/issues/6701#issuecomment-1717659333. Posting similar information as the responder in that other thread would be glorious. Thank you!

WarpRat commented 1 year ago

@sanchda Before discovering this thread we were working towards being able to capture core dumps for further troublesshooting. Would privately sharing one of those be sufficient or is there additional information you'd get from adding that binary to our containers?

sanchda commented 1 year ago

@WarpRat, a corefile is better than my suggestion. That'd be wonderful.

Typically, submitting corefiles etc is something we do through our Support organization. You can submit a ticket at https://help.datadoghq.com/hc/en-us/requests/new. This is slightly annoying--I apologize for the extra step, but it would really, really help. If you mention Profiling (and possibly link this conversation), it should ping my team. This will give you a confidential channel to discuss the particular defect at your organization, plus it'll allow me to look at your organization's Datadog installation directly, which can have a lot of useful circumstantial information.

In order for me to make full use of a corefile, it would help to have some additional information:

container base image (or precise distro version)
machine architecture (x86/aarch64)
output of pip freeze or similar

Many, many thanks! If this is just too much extra work for you to do, I understand--let me know and we'll try to figure out the next-best-way of iterating.

sanchda commented 1 year ago

Updating the issue to point to myself and to indicate the faulty component as Profiling. We're still not 100% sure that Profiling is a fundamental cause, but we're the side that will own coordination on this end.

Thank you for your patience, everyone!

sanchda commented 1 year ago

Hi everyone,

Still working on this one. We're investigating a few angles related to wrapping. Is anyone able to test setting the WRAPT_DISABLE_EXTENSIONS=true environment variable? I'm maybe ~35% confident this will help anything, but it should be a quick and easy test.

Thanks

nat45928 commented 1 year ago

I just tried the WRAPT_DISABLE_EXTENSIONS setting in my environment and still saw segmentation faults (~30min intervals).

I'm on K8s (Microk8s, Ubuntu 22.04.1, Linux 5.15.0-58-generic) with a Python Django workload running via Daphne.

sanchda commented 1 year ago

@nat45928 many thanks for trying that out! That eliminates one potential angle.

sanchda commented 1 year ago

Hi again, everyone.

I have an x86_64+glibc release of ddtrace available, if anyone is interested in trying it out here. I had to host it out of a personal repo (sorry!), since we don't have a great workflow for sharing out-of-band custom builds.

Here's the URL. You'll have to consult with the docs of your particular package management system to see how to consume this. https://github.com/sanchda/test_gcr/releases/download/v1.16.1_innerapi/ddtrace-2.1.0.dev34+g9bd1fb9bb-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Another thing that would be useful is setting the PYTHONFAULTHANDLER environment variable to true. This will surface the Python context at time-of-crash, which will give us some hints to go by.

I'm not very optimistic that this will fix the problem for anyone, but it might yield some useful diagnostic information. Thank you for your patience.

kravnaut commented 1 year ago

@kravnaut @mLupine @pgrzesik @WarpRat May I check with you whether the profiler is also enabled? We are aware of similar issues that might be caused by the profiler on CPython 3.11. So if you have the profiler enabled, would disabling it solve your problem? We expect this to be the case, but would love your confirmation 🙏 .

We are not using profiler. And as an update, after python 3.11 upgrade we also started to get sporadic segmentation faults...

P403n1x87 commented 1 year ago

Thanks for confirming @kravnaut. Your original issue is likely related to tracing. Anybody else using OpenAI and tracing are affected by an issue that should be resolved by #7044. There still is a chance that the same PR fixes the original issue, since the fix is for a core component also used in other tracing integrations. So it would be worth trying the latest release with the fix that should come out soon. If that doesn't fix it, the tracing team will keep investigating this.

shughes-uk commented 11 months ago

We are using ddtrace 2.0.2 with profiling enabled, and see segfaults. With profiling disabled they go away.

Not using OpenAI, we are using psycopg2. Will try and get PYTHONFAULTHANDLER set to see if we can get y'all some logs.

jlucas91 commented 11 months ago

Hi - chiming in here to note that we're also seeing segfaults across both our instances running celery and those serving web requests via uvicorn (ddtrace 2.0.2). Turning off profiling appears to resolve the issue.

shughes-uk commented 11 months ago

Still occurring after 2.1.1. Having trouble getting PYTHONFAULTHANDLER output, the way it logs it seems to not behave well with the way dd-agent gets logs from docker containers?

P403n1x87 commented 11 months ago

@jlucas91 @shughes-uk Thanks for the extra reports. Would you be able to test whether turning the tracer off (or just the psycopg integration) while keeping the profiler on also resolves the issue?

shughes-uk commented 10 months ago

I can try that, our production environment runs with profiling off and does not encounter these segfaults though.

I managed to get the output from one of the segfaults and it looks like the relevant part is

File "/opt/coiled/env/lib/python3.11/site-packages/ddtrace/profiling/collector/__init__.py", line 42 in periodic

The only other thread not just sitting on threading.wait was our code that checks memory usage, it was on a line that opens a file handler to /sys/fs/cgroup/memory.current so we can have a quick look at what cgroup says our memory usage is. Not sure if y'all interact with that or not.

WarpRat commented 10 months ago

@kravnaut @mLupine @pgrzesik @WarpRat May I check with you whether the profiler is also enabled? We are aware of similar issues that might be caused by the profiler on CPython 3.11. So if you have the profiler enabled, would disabling it solve your problem? We expect this to be the case, but would love your confirmation 🙏 .

Sorry for the late reply here. We are (were) using profiling and can confirm that disabling it resolves the seg fault issue. It has been an acceptable work around but we're definitely eager to reenable profiling when we can. We're also using fastapi/uvicorn for all the services where we see this. This looked promising, but we're on 3.11 not 3.12 - will that fix potentially address this issue on 3.11 too?

P403n1x87 commented 10 months ago

@WarpRat many thanks for your feedback and patience. The PR you mentioned, and that we recently merged, affects CPython 3.11 too, so it is definitely worth a shot. We currently have work in progress to further improve our support of the latest CPython releases, that is 3.11 and 3.12, so we expect to get to a position where this and similar issues are resolved with confidence in the coming months.

github-actions[bot] commented 4 months ago

This issue has been automatically closed after a period of inactivity. If it's a feature request, it has been added to the maintainers' internal backlog and will be included in an upcoming round of feature prioritization. Please comment or reopen if you think this issue was closed in error.

DataDog / dd-trace-py