Talent-Catalog / talentcatalog

https://tctalent.org
GNU Affero General Public License v3.0
12 stars 4 forks source link

Restart services on Out of Memory #1063

Closed sadatmalik closed 1 month ago

sadatmalik commented 4 months ago

As a DevOps I want to ensure that the AWS task restarts upon Out of Memory errors Because we can no longer rely on the service after this state.

sadatmalik commented 3 months ago

Proposed Solution for Handling OutOfMemoryError

Simple but possibly problematic approach: One potential approach is to use a JVM flag to force the JVM to stop on an OutOfMemoryError. However, this method results in an abrupt stoppage, preventing Fargate from executing a blue-green failover. Consequently, users may experience a few minutes of downtime, during which they could encounter timeout or connection refused errors.

Maintenance Page Suggestion: It would be a good idea to set up a maintenance page in Fargate that can be displayed during complete service downtime. This way, users see a maintenance message instead of encountering timeouts or connection errors if the service fails for any reason - opening a new ticket for this.

Recommended Solution: A better solution involves using AWS CloudWatch capabilities to monitor logs for OutOfMemoryError occurrences. CloudWatch can be set up to detect these errors and generate an alarm that triggers a Fargate service restart. This method allows for blue-green failovers and minimises perceived user downtime.

Needs further investigation, but seems to be a good way to handle the OutOfMemoryError scenario without causing an inadvertent full-service disruption.

sadatmalik commented 3 months ago

Steps required:

Note - the initial solution was to trigger the redeploy via event bridge - but proved much easier (and a neater end to end solution) using lambda functions. I didn't initially go this way due to cost uncertainty - but we have 1 million free lambda calls a month so not a problem for our usage.

Keeping the below task just for documentation - it is not required and should not be reproduced in production. I have deleted the even rule in staging.

sadatmalik commented 3 months ago
  1. Create a log metric filter in CloudWatch to detect OutOfMemoryError in TC service logs.

Should now have a log metric filter with the following parameters:

Log Filter: Filter Pattern: java.lang.OutOfMemoryError Filter Name: OOMErrorFilter

Metric: Metric namespace: ECS/ContainerInsights Metric name: OOMErrors Metric value: 1

Thus:

Screenshot 2024-07-17 at 12.54.41.png
sadatmalik commented 3 months ago
  1. Create a CloudWatch Alarm based on the metric filter (to trigger when an OOM error is detected)
Screenshot 2024-07-17 at 13.20.29.png

Thus:

Screenshot 2024-07-17 at 13.41.04.png
sadatmalik commented 3 months ago

Manually triggered alarm for testing using filter pattern "uid: 141481"

Screenshot 2024-07-17 at 13.52.15.png
sadatmalik commented 3 months ago

The final solution does not use event rules - using lambda's instead - therefore, this step is not required

  1. Set up a CloudWatch Fargate Event Rule to restart the ECS service when the CloudWatch alarm triggers:

Create a CloudWatch Event Rule:

Create an EventBridge Rule

Build event pattern:

{
  "source": ["aws.cloudwatch"],
  "detail-type": ["CloudWatch Alarm State Change"],
  "detail": {
    "state": {
      "value": ["ALARM"]
    },
    "alarmName": ["OOMErrorAlarm"]
  }
}

Select target(s):

{
  "cluster": "$.detail.clusterArn",
  "service": "$.detail.serviceArn"
}
{
  "cluster": "tctalent-test",
  "service": "tctalent-test",
  "forceNewDeployment": true
}

Tags:

Thus:

Screenshot 2024-07-17 at 15.27.06.png
sadatmalik commented 2 months ago

The final solution does not use event rules - using lambda's instead - therefore, this step is not required

At this stage, the log metric filter is working. The alarm is triggered. The event rule is being invoked, but the invocation is failing. There is no further insight in the AWS console as to why the event invocation is failing.

This step will configure a DLQ (dead letter queue) to capture the failed event invocations to see more detailed information for the failure reason:

Then:

Screenshot 2024-07-22 at 09.15.33.png

To view messages in the DLQ:

sadatmalik commented 2 months ago

The final solution does not use event rules - using lambda's instead - therefore, this step is not required

The event rule was failing due to the use of aliases for subnets and security groups. This has been resolved in the event rule definition (with comments updated further above in the appropriate set up section).

The rule is not failing now but the service is not redeploying.

I suspect that this is now due to IAM setup, which needs updating with the relevant permissions.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ecs:RunTask",
                "ecs:UpdateService",
                "ecs:DescribeServices",
                "ecs:DescribeTaskDefinition"
            ],
            "Resource": [
                "arn:aws:ecs:*:231168606641:task-definition/tctalent-test:*",
                "arn:aws:ecs:*:231168606641:task-definition/tctalent-test",
                "arn:aws:ecs:*:231168606641:service/tctalent-test/*",
                "arn:aws:ecs:*:231168606641:cluster/tctalent-test"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "*",
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": "ecs-tasks.amazonaws.com"
                }
            }
        }
    ]
}
sadatmalik commented 2 months ago

At this stage decided to use AWS Lambda functions instead of persevering with event rules

Still needs work - the event rule starts a new task:

Screenshot 2024-07-22 at 16.08.03.png

But what we want is to redeploy the service, which is not happening:

Screenshot 2024-07-22 at 16.08.54.png

Needs further investigation - may need to use AWS Lambda.

sadatmalik commented 2 months ago

Creating a Lambda function to restart the ECS Fargate service:

Nb: 1 million requests / month - free

Add Python code to redeploy ECS service:

import boto3
import json
from datetime import datetime

def lambda_handler(event, context):
    ecs_client = boto3.client('ecs')

    cluster_name = 'tctalent-test'
    service_name = 'tctalent-test'

    response = ecs_client.update_service(
        cluster=cluster_name,
        service=service_name,
        forceNewDeployment=True
    )

    # Convert any datetime objects in the response to strings
    def convert_datetime(obj):
        if isinstance(obj, datetime):
            return obj.isoformat()
        raise TypeError("Type not serializable")

    # Return the response as a JSON serializable object
    return {
        'statusCode': 200,
        'body': json.loads(json.dumps(response, default=convert_datetime))
    }

Check IAM Permissions:

    {
            "Effect": "Allow",
            "Action": [
                "ecs:UpdateService",
                "ecs:DescribeServices",
                "ecs:DescribeTaskDefinition"
            ],
            "Resource": [
                "arn:aws:ecs:us-east-1:231168606641:service/tctalent-test/tctalent-test",
                "arn:aws:ecs:us-east-1:231168606641:cluster/tctalent-test"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "*",
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": "ecs-tasks.amazonaws.com"
                }
            }
        }

Add a resource based policy to the lambda function to allow it to be invoked from cloud watch alarms

See: https://medium.com/@dithya512m/trigger-aws-lambda-directly-from-cloudwatch-alarm-d9844a410e8c

Notes:

sadatmalik commented 2 months ago

Lambda Testing:

Screenshot 2024-07-24 at 15.07.54.png
sadatmalik commented 2 months ago

Tested and working:

The log filter picks up the OOM exception. Triggering the configured cloud watch alarm, which calls the service deployment lambda function:

Screenshot 2024-07-24 at 14.50.04.png

And this redeploys the ECS Fargate service:

Screenshot 2024-07-24 at 11.50.11.png
sadatmalik commented 2 months ago

Prod configuration:

Screenshot 2024-07-29 at 10.03.24.png Screenshot 2024-07-29 at 12.35.59.png

Test:

Triggers the lambda:

Screenshot 2024-07-29 at 14.40.01.png

And redeploys the service:

Screenshot 2024-07-29 at 14.39.55.png
sadatmalik commented 2 months ago

Tettra Doc: https://app.tettra.co/teams/talentbeyondboundaries/pages/aws-processes