GoogleCloudPlatform / cloud-profiler-python

Stackdriver Profiler Python agent is a tool that continuously gathers CPU usage information from Python applications
Apache License 2.0
27 stars 23 forks source link

segmentation fault under high load #142

Open velichkd opened 7 months ago

velichkd commented 7 months ago

We noticed that we get sporadic restarts in one of our apps, especially when under high load.

Our apps are running on gke: k8s version: 1.26.7 machine image: cos_containerd

python version: 3.11 fastapi & uvicorn

google-cloud-profiler version: 4.1.0

I've enabled debug logging in profiler, but I don't get much more details except:

DEBUG:googlecloudprofiler.client:Starting to create profile DEBUG:googlecloudprofiler.client:Successfully created a CPU profile Segmentation fault

some additional logs: kernel: [ 8685.460126] uvicorn[86674]: segfault at 48 ip 00007f78dc517b01 sp 00007f78dcff7020 error 4 in _profiler.cpython-311-x86_64-linux-gnu.so[7f78dc514000+a5000] likely on CPU 0 (core 0, socket 0) kernel: [ 8685.477270] Code: ff ff 0f 1f 80 00 00 00 00 41 54 55 48 89 fd 53 48 85 f6 0f 84 a8 00 00 00 48 8b 46 38 48 8b 58 08 48 85 db 0f 84 8c 00 00 00 <80> 7b 45 01 74 19 48 8b 43 20 48 63 90 a8 00 00 00 48 8d 84 50 b8

have you faced similar issues before and is there a possible solution/workaround?

aabmass commented 6 months ago

Hey @velichkd thanks for reporting this issue. I had to make some changes to the C++ code for Python 3.11 in https://github.com/GoogleCloudPlatform/cloud-profiler-python/pull/137 due to changes in CPython. This seems like a bug, although things were working OK in our internal integration test suite.

Are you able to provide a repro for the bug or any chance you got a backtrace from the segfault?

aabmass commented 6 months ago

Also, could you share the exact version of python interpreter you're using?

velichkd commented 6 months ago

Hi @aabmass, thanks for replying.

Unfortunately, I don't have a repro of the issue. We keep a small private python package that contains gcp profiler, tracing, logging etc set up and reuse it in all other apps. I forgot to mention that we have two other applications with the same python version that are using the same profiler config and they are running fine. Before migrating to python 3.11 we were at 3.8 and all other apps are working fine.

Also, I couldn't get a backtrace no matter what as locally it is not reproduceable, basically due to the entire application logic and the dependencies between the apps, locally we can't put much pressure on it.

I guess it is some kind of corner case, specific just to that application, for now we postponed the python migration to 3.11 due to observed behavior.

If I can help further, let me know.

Here is the exact python version:

python --version

Python 3.11.8
aabmass commented 6 months ago

Thanks for the details! I was actually able to figure out roughly where the segfault is happening with @quentinmit's help and the code dump you provided.

I forgot to mention that we have two other applications with the same python version that are using the same profiler config and they are running fine.

Are the other applications using asyncio/uvicorn?

Any way you could try it with Python v3.11.4 or earlier?

velichkd commented 6 months ago

Hey,

Are the other applications using asyncio/uvicorn?

yeap, more or less the same setup everywhere.

Any way you could try it with Python v3.11.4 or earlier?

Might take some time, but I think it should be doable. Tomorrow I'll give it a try

velichkd commented 6 months ago

Hi,

I've just tried to run it with Python 3.11.0, unfortunately, the result is still the same:

[518728.178483] uvicorn[1860797]: segfault at 102444c7f0 ip 00007f2514726268 sp 00007ffc371eb3b0 error 4 in _profiler.cpython-311-x86_64-linux-gnu.so[7f2514720000+99000]\r\n
aabmass commented 6 months ago

Thanks! I'll see if I can reproduce it. If you find out it's anything particular in this app, please update the issue.

aabmass commented 1 month ago

After another similar report, I made an example and was able to repro the issue with simple code:

import googlecloudprofiler
import asyncio

# Profiler initialization. It starts a daemon thread which continuously
# collects and uploads profiles. Best done as early as possible.
googlecloudprofiler.start(
    service="profiler-segfault-repro",
    service_version="1.0.1",
    # verbose is the logging level. 0-error, 1-warning, 2-info,
    # 3-debug. It defaults to 0 (error) if not set.
    verbose=3,
    # project_id must be set if not running on GCP.
)

async def main():
    while True:
        await asyncio.gather(*[dostuff() for _ in range(50)])
        print("Finished awaiting 50 dostuff()")

async def dostuff():
    await asyncio.gather(*[asyncio.sleep(0.001) for _ in range(1000)])

asyncio.run(main())

This segfaults on 3.11 and 3.12.

aabmass commented 1 month ago

Using python 3.12.3, I'm noticing the segfault is really only happening when the signal handler happens to interrupt this code https://github.com/python/cpython/blob/v3.12.3/Python/ceval.c#L704 during _PyEval_EvalFrameDefault image

It looks like the new cframe is set on the PyThreadState object before it is finished being initialized. When the signal handler tries to read it, it is not yet completely initialized and has invalid pointers. Specifically, I'm thinking the issue can happen if the signal handler interrupts the code anywhere in in these lines of codes after the cframe is created and set on the thread's PyThreadState, but before cframe->current_frame is set.

I'm not sure how we can protect against this case (and other similar ones in the same function) but I'm pretty sure it's the cause.