aws / aws-lambda-python-runtime-interface-client

Apache License 2.0
258 stars 73 forks source link

`awslambdaric` prevents graceful lambda shutdown #105

Open aclemons opened 1 year ago

aclemons commented 1 year ago

Overview

I've followed the guide described in graceful-shutdown-with-aws-lambda. Unfortunately, my signal handler is never invoked.

I spent time investigating and it looks like the native code in awslambdaric is not bubbling up the signal preventing the signal handler registered from ever running. I don't have experience with writing native code python extensions, so I don't know what needs to change in runtime_client.cpp so that this works.

The graceful shutdown guide actually has a note at the end of the readme which says:

Please be aware that this feature does not work on Python 3.8 and 3.9 runtimes.

I think this is exactly because of the issue I am describing here (and mentioned here https://github.com/aws-samples/graceful-shutdown-with-aws-lambda/issues/2).

Steps to reproduce.

I took a simple lambda function handler.py:

import signal
from types import FrameType

def shutdown(_signum: int, _frame: FrameType | None) -> None:
    print("Shutdown hook invoked")

signal.signal(signal.SIGTERM, shutdown)

def lambda_handler(event, context):
    print(f"Handled event {event}")

    return "SUCCESS"

and created a lambda in eu-central-1 with this as a zip, adding the layer arn:aws:lambda:eu-central-1:580247275435:layer:LambdaInsightsExtension:38 and invoked it once. I waited until the instance was spun down (10-15 minutes), but there was no log from the shutdown hook.

I also deployed it as a docker image with this Dockerfile:

FROM public.ecr.aws/lambda/python:3.10.2023.07.13.16

# download manually before building the image:
# curl -f -s -o lambda-insights-amd64-38.zip "$(aws lambda get-layer-version-by-arn --arn "arn:aws:lambda:eu-central-1:580247275435:layer:LambdaInsightsExtension:38" --query 'Content.Location' --output text)"

COPY lambda-insights-amd64-38.zip /tmp
# hadolint ignore=DL3033
RUN yum install -y unzip && unzip -q -d /opt /tmp/lambda-insights-amd64-38.zip && rm /tmp/*.zip && \
    yum history undo -y last && yum clean all && rm -rf /var/cache/yum

COPY handler.py "$LAMBDA_TASK_ROOT/handler.py"

CMD ["handler.lambda_handler"]

I pushed this into ECR and then created a container-based lambda with this image again in eu-central-1 and invoked it once. Again, I waited until the instance was spun down (10-15 minutes), but there was no log from the shutdown hook.

To prove that SIGTERM was being sent to the runtime, I altered the previous image to just be a simple shell script and updated the container-based lambda from the previous step:

startup.sh

#!/usr/bin/env bash

set -e
set -o pipefail

printf 'Starting lambda handler. Running as pid %s\n' "$BASHPID"

shutdown() {
  printf 'Shutting down gracefully\n'
  exit
}
trap 'shutdown' SIGTERM

while true ; do
  HEADERS="$(mktemp)"
  BODY="$(mktemp)"
  curl -f -sS -LD "$HEADERS" -X GET "http://$AWS_LAMBDA_RUNTIME_API/2018-06-01/runtime/invocation/next" -o "$BODY"
  REQUEST_ID="$(grep -Fi Lambda-Runtime-Aws-Request-Id "$HEADERS" | tr -d '[:space:]' | cut -d: -f2)"
  rm -rf "$HEADERS"

  printf 'Processing request %s\n' "$REQUEST_ID"

  curl -f -sS -X POST "http://$AWS_LAMBDA_RUNTIME_API/2018-06-01/runtime/invocation/$REQUEST_ID/response" -d '"SUCCESS"'
  rm -rf "$BODY"
done

with this Dockerfile:

FROM public.ecr.aws/lambda/python:3.10.2023.07.13.16

# download manually before building the image:
# curl -f -s -o lambda-insights-amd64-38.zip "$(aws lambda get-layer-version-by-arn --arn "arn:aws:lambda:eu-central-1:580247275435:layer:LambdaInsightsExtension:38" --query 'Content.Location' --output text)"

COPY lambda-insights-amd64-38.zip /tmp
# hadolint ignore=DL3033
RUN yum install -y unzip && unzip -q -d /opt /tmp/lambda-insights-amd64-38.zip && rm /tmp/*.zip && \
    yum history undo -y last && yum clean all && rm -rf /var/cache/yum

COPY startup.sh "$LAMBDA_TASK_ROOT/startup.sh"

ENTRYPOINT ["/var/runtime/startup.sh"]

I invoked the lambda and got the "SUCCESS" response. I waited for it to spin down and checked the cloudwatch logs again:

LOGS    Name: cloudwatch_lambda_agent   State: Subscribed   Types: [Platform]
Starting lambda handler. Running as pid 12
Name: cloudwatch_lambda_agent   State: Ready    Events: [INVOKE, SHUTDOWN]
START RequestId: 1127e4c8-97e8-4ef9-9c52-e24673eb40a7 Version: $LATEST
Processing request 1127e4c8-97e8-4ef9-9c52-e24673eb40a7
{""status"":""OK""}
END RequestId: 1127e4c8-97e8-4ef9-9c52-e24673eb40a7
REPORT RequestId: 1127e4c8-97e8-4ef9-9c52-e24673eb40a7  Duration: 304.42 ms Billed Duration: 590 ms Memory Size: 128 MB Max Memory Used: 35 MB  Init Duration: 285.31 ms    
Terminated
Shutting down gracefully

Success, so I knew that SIGTERM was being sent. After this, I suspected that it had to be somewhere in the runtime handling for python. While reading the source code for awslambdaric I realised part of it was written in cpp and I wondered if this might be the issue.

To test whether this was the problem, I replaced the runtime_client written in cpp, with one written in python.

runtime_client_py.py

# Reimplementation of runtime_client.cpp in python

import http
import http.client
import os

lambda_runtime_address = os.environ["AWS_LAMBDA_RUNTIME_API"]
user_agent = "awslambdaric"

class RuntimeClient:
    @staticmethod
    def initialize_client(user_agent: str) -> None:
        user_agent = user_agent

    @staticmethod
    def next() -> tuple[bytes, dict]:
        endpoint = "/2018-06-01/runtime/invocation/next"
        runtime_connection = http.client.HTTPConnection(lambda_runtime_address)
        runtime_connection.connect()

        runtime_connection.request("GET", endpoint, headers={"User-Agent": user_agent})
        response = runtime_connection.getresponse()
        response_body = response.read()
        raw_headers = response.headers

        headers = {
            "Lambda-Runtime-Aws-Request-Id": raw_headers.get(
                "Lambda-Runtime-Aws-Request-Id"
            ),
            "Lambda-Runtime-Trace-Id": raw_headers.get("Lambda-Runtime-Trace-Id")
            or None,
            "Lambda-Runtime-Invoked-Function-Arn": raw_headers.get(
                "Lambda-Runtime-Invoked-Function-Arn"
            ),
            "Lambda-Runtime-Deadline-Ms": int(
                raw_headers.get("Lambda-Runtime-Deadline-Ms") or "0"
            ),
            "Lambda-Runtime-Client-Context": raw_headers.get(
                "Lambda-Runtime-Client-Context"
            )
            or None,
            "Lambda-Runtime-Cognito-Identity": raw_headers.get(
                "Lambda-Runtime-Cognito-Identity"
            )
            or None,
            "Content-Type": raw_headers.get("Content-Type") or None,
        }

        return response_body, headers

    @staticmethod
    def post_invocation_result(
        invoke_id: str, result_data: bytes, content_type: str
    ) -> None:
        from awslambdaric.lambda_runtime_client import LambdaRuntimeClientError

        endpoint = f"/2018-06-01/runtime/invocation/{invoke_id}/response"
        runtime_connection = http.client.HTTPConnection(lambda_runtime_address)
        runtime_connection.connect()

        runtime_connection.request(
            "POST",
            endpoint,
            body=result_data,
            headers={"User-Agent": user_agent, "Content-Type": content_type},
        )

        response = runtime_connection.getresponse()

        if response.status != http.HTTPStatus.ACCEPTED:
            raise LambdaRuntimeClientError(endpoint, response.status, None)

    @staticmethod
    def post_error(invoke_id: str, error_response_data: bytes, xray_fault: str) -> None:
        endpoint = f"/2018-06-01/runtime/invocation/{invoke_id}/error"
        runtime_connection = http.client.HTTPConnection(lambda_runtime_address)
        runtime_connection.connect()

        runtime_connection.request(
            "POST",
            endpoint,
            body=error_response_data,
            headers={
                "User-Agent": user_agent,
                "Content-Type": "application/json",
                "Lambda-Runtime-Function-Error-Type": "UnhandledContent",
                "Lambda-Runtime-Function-xray-error-cause": xray_fault,
            },
        )

runtime_client = RuntimeClient()

and I put this in an image with this Dockerfile:

FROM public.ecr.aws/lambda/python:3.10.2023.07.13.16

# download manually before building the image:
# curl -f -s -o lambda-insights-amd64-38.zip "$(aws lambda get-layer-version-by-arn --arn "arn:aws:lambda:eu-central-1:580247275435:layer:LambdaInsightsExtension:38" --query 'Content.Location' --output text)"

COPY lambda-insights-amd64-38.zip /tmp
# hadolint ignore=DL3033
RUN yum install -y unzip && unzip -q -d /opt /tmp/lambda-insights-amd64-38.zip && rm /tmp/*.zip && \
    yum history undo -y last && yum clean all && rm -rf /var/cache/yum

RUN rm "$LAMBDA_RUNTIME_DIR/runtime_client.cpython"*.so && sed -i 's/import runtime_client/from awslambdaric.runtime_client import runtime_client/' "$LAMBDA_RUNTIME_DIR/awslambdaric/lambda_runtime_client.py"
COPY runtime_client_py.py "$LAMBDA_RUNTIME_DIR/awslambdaric/runtime_client.py"
COPY handler.py "$LAMBDA_TASK_ROOT/handler.py"

CMD ["handler.lambda_handler"]

I updated the image used by the lambda again and invoked it and waited for it to spin back down. I checked the cloudwatch logs again:

LOGS    Name: cloudwatch_lambda_agent   State: Subscribed   Types: [Platform]
EXTENSION   Name: cloudwatch_lambda_agent   State: Ready    Events: [INVOKE, SHUTDOWN]
START RequestId: faab07eb-60f6-47c6-b5ba-0dced4097ec9 Version: $LATEST
Handled event {'test': 'value'}
END RequestId: faab07eb-60f6-47c6-b5ba-0dced4097ec9
REPORT RequestId: faab07eb-60f6-47c6-b5ba-0dced4097ec9  Duration: 48.67 ms  Billed Duration: 383 ms Memory Size: 128 MB Max Memory Used: 45 MB  Init Duration: 333.96 ms    
Shutdown hook invoked

Success! So indeed it looks like the native code needs some adjustment to handle the signal.

Unfortunately, I can't offer a pull request to fix this since I don't know enough about how this works. Some searching points out that perhaps the code needs to be using PyErr_CheckSignals. awslambdaric is also using aws-lambda-runtime internally, so there might be some interaction there too that needs even more handling.

Anyway, I hope these reproduction steps can help someone with the skills find the correct solution.

Thanks.

tgsong commented 11 months ago

I just ran into the same problem and found this GitHub issue.

Thanks for all the details @aclemons provide, it helped me to locate the problem. The issue is that, awslambdaric uses aws-lambda-cpp under the hood, which uses libcurl to poll for lambda event, and this is a blocking call.

In more details, what happens is that this line would call into the C extension, https://github.com/aws/aws-lambda-python-runtime-interface-client/blob/2d0c421ec588a05e6efd8773fa1416fe5433bae0/awslambdaric/lambda_runtime_client.py#L68

which then calls this line and it blocks the main thread, https://github.com/aws/aws-lambda-python-runtime-interface-client/blob/2d0c421ec588a05e6efd8773fa1416fe5433bae0/awslambdaric/runtime_client.cpp#L35

According to the Python signal document, even if SIGTERM is received, the registered signal handler will not be called until the current instruction finishes. In our case, the Python interpreter is waiting for runtime_client.next() to finish. This also explains why the pure Python reimplementation of runtime_client will work.

Fortunately, the solution is simple, we can just move the blocking call into a thread to unblock the main thread. Since the C extension will release the GIL (#33) before polling the lambda API, main thread will be able to call the signal handler now.

with ThreadPoolExecutor() as e:
    fut = e.submit(runtime_client.next)
response_body, headers = fut.result()
smirnoal commented 11 months ago

@tgsong do you mind creating a PR with that change?

tgsong commented 11 months ago

@smirnoal done, #115