Restart services on Out of Memory

sadatmalik commented 4 months ago

As a DevOps I want to ensure that the AWS task restarts upon Out of Memory errors Because we can no longer rely on the service after this state.

sadatmalik commented 3 months ago

Proposed Solution for Handling OutOfMemoryError

Simple but possibly problematic approach: One potential approach is to use a JVM flag to force the JVM to stop on an OutOfMemoryError. However, this method results in an abrupt stoppage, preventing Fargate from executing a blue-green failover. Consequently, users may experience a few minutes of downtime, during which they could encounter timeout or connection refused errors.

Maintenance Page Suggestion: It would be a good idea to set up a maintenance page in Fargate that can be displayed during complete service downtime. This way, users see a maintenance message instead of encountering timeouts or connection errors if the service fails for any reason - opening a new ticket for this.

Recommended Solution: A better solution involves using AWS CloudWatch capabilities to monitor logs for OutOfMemoryError occurrences. CloudWatch can be set up to detect these errors and generate an alarm that triggers a Fargate service restart. This method allows for blue-green failovers and minimises perceived user downtime.

Needs further investigation, but seems to be a good way to handle the OutOfMemoryError scenario without causing an inadvertent full-service disruption.

sadatmalik commented 3 months ago

Steps required:

[x] Configure ECS Task to Send Logs to CloudWatch- already done.
[x] Create a CloudWatch Log Metric Filter for OutOfMemoryError.
[x] Create a Lambda function to force a redeployment of the ECS Service on Fargate
[x] Create a CloudWatch Alarm based on the metric filter that will directly invoke the redeployment lambda

Note - the initial solution was to trigger the redeploy via event bridge - but proved much easier (and a neater end to end solution) using lambda functions. I didn't initially go this way due to cost uncertainty - but we have 1 million free lambda calls a month so not a problem for our usage.

Keeping the below task just for documentation - it is not required and should not be reproduced in production. I have deleted the even rule in staging.

[x] Create a CloudWatch (Fargate) Event Rule to invoke the redeployment Lambda when the alarm triggers.

sadatmalik commented 3 months ago

Create a log metric filter in CloudWatch to detect OutOfMemoryError in TC service logs.

[x] Navigate to the CloudWatch console.
[x] Choose "Logs" in the navigation pane.
[x] Select the log group associated with your ECS tasks (e.g. /fargate/service/tctalent-test-fargate-log in staging)
[x] Select "Metric Filter" tab and click "Create metric filter"
[x] In "Filter Pattern", enter a pattern to match "java.lang.OutOfMemoryError":
[x] Select "Next" to proceed to "Assign metric".
[x] Provide a filter name: "OOMErrorFilter".
[x] Under "Metric Details", provide the namespace and metric name, select existing namespcace "ECS/ContainerInsights" and metric name "OOMErrors"
[x] Set Metric value to "1"
[x] Select "Next" then "Create filter".

Should now have a log metric filter with the following parameters:

Log Filter: Filter Pattern: java.lang.OutOfMemoryError Filter Name: OOMErrorFilter

Metric: Metric namespace: ECS/ContainerInsights Metric name: OOMErrors Metric value: 1

Thus:

sadatmalik commented 3 months ago

Create a CloudWatch Alarm based on the metric filter (to trigger when an OOM error is detected)

[x] Navigate to the CloudWatch console.
[x] Choose "All alarms" in the navigation pane.
[x] Select "Create Alarm".
[x] Choose "Select metric".
[x] Navigate to the namespace and metric created in the previous step (ECS/OOMErrors).
- [x] I don't see the metric listed - NB: It will not display until the filter has been triggered - therefore change the filter pattern to "uid: 141481" to forcibly trigger it - after doing this it is displayed in the metrics view:

[x] Select the metric and choose "Select metric".
[x] Configure the alarm details:
- [x] Set the threshold type to "Static".
- [x] Select Greater/Equal
- [x] Define the threshold value as 1.
- [x] Set the period to match monitoring needs (choosing 5 minute).
- [x] Choose "Next".
[x] In the "Configure actions" step, add a Lambda action
- [x] Alarm state trigger: In alarm
- [x] Select lambda from signed in account
- [x] Choose ECSServiceRedeployFunction
- [x] Choose "Next".
[x] Provide a name for the alarm ("OOMErrorAlarm").
[x] Review the settings and choose "Create alarm".

Thus:

sadatmalik commented 3 months ago

Manually triggered alarm for testing using filter pattern "uid: 141481"

[x] Remember to revert the filter pattern back to java.lang.OutOfMemoryError after testing

sadatmalik commented 3 months ago

The final solution does not use event rules - using lambda's instead - therefore, this step is not required

Set up a CloudWatch Fargate Event Rule to restart the ECS service when the CloudWatch alarm triggers:

Create a CloudWatch Event Rule:

[x] Go to the CloudWatch console.
[x] Choose "Rules" and then "Create rule" [This re-routes to Amazon EventBridge]

Create an EventBridge Rule

[x] Choose "Create rule".
[x] Enter a name and description for the rule: "RestartECSCrudServiceOnOOMError"
[x] Event bus: default
[x] Enable the rule on the selected event bus: Enabled (toggle on)
[x] Rule type: Rule with an event pattern
[x] Select "Next"

Build event pattern:

[x] For Event source select "AWS events or EventBridge partner events".
[x] Creation method: "Custom pattern (JSON editor)"
[x] Use the following event pattern:

{
  "source": ["aws.cloudwatch"],
  "detail-type": ["CloudWatch Alarm State Change"],
  "detail": {
    "state": {
      "value": ["ALARM"]
    },
    "alarmName": ["OOMErrorAlarm"]
  }
}

[x] Click "Next"

Select target(s):

[x] Target types: select AWS service
[x] Select a target: Lambda function
[x] Function: ECSServiceReployFunction
[x] Select "Next", "Next", "Create Rule"
[x] Configure Target Details
- [x] Cluster: Select your ECS cluster. (e.g. tctalent-test)
- [x] Task Definition: Enter the ARN of your task definition (e.g. tctalent-test)
- [x] Select "Latest"
- [x] Task Count: Set to 2
[x] Compute options
- [x] Select Launch Type
- [x] Select FARGATE
[x] Network Configuration:
- [x] Configure the network settings to match the Fargate service - find this in ECS | Clusters | Configuration and networking.
- [x] Use same subnet - copy paste (do not use aliases - that fails when the event invoked - AWS bug?)
- [x] Use same security group - copy paste - do not use aliases - the event will not fire if you do!
- [x] Allow auto created execution role
[x] Additional settings:
- [x] Configure target input: Select "Input transformer" from the dropdown
- [x] Click "Configure Input Transformer":
- [x] In Target Input Transformer enter this for "Input Path":

{
  "cluster": "$.detail.clusterArn",
  "service": "$.detail.serviceArn"
}

[x] And this for the "Input Template":

{
  "cluster": "tctalent-test",
  "service": "tctalent-test",
  "forceNewDeployment": true
}

[x] Click "Confirm"
[x] Select "Next"

Tags:

[x] Key: "tc-oom-restart-event-rule"
[x] Select "Next"
[x] Review and select "Create rule"

Thus:

sadatmalik commented 2 months ago

The final solution does not use event rules - using lambda's instead - therefore, this step is not required

At this stage, the log metric filter is working. The alarm is triggered. The event rule is being invoked, but the invocation is failing. There is no further insight in the AWS console as to why the event invocation is failing.

This step will configure a DLQ (dead letter queue) to capture the failed event invocations to see more detailed information for the failure reason:

[x] Go to Amazon SQS console and click "Create queue"
[x] Type: "Standard" (Tried FIFO queues but DLQ doesn't work with them)
[x] Name: "OOMErrorQueue"
[x] Use all default settings - click "Create queue"

Then:

[x] Edit event bridge rule
[x] In "Additional settings", select the SQS queue as the DLQ:

[x] Select Next and update rule

To view messages in the DLQ:

[x] Select the OOMErrorQueue in the SQS console
[x] Click on the "Send and receive messages" button
[x] In the "Receive messages" section, click on the "Poll for messages" to retrieve messages from the queue
[x] Once messages are retrieved, click on a message to view its details

sadatmalik commented 2 months ago

The final solution does not use event rules - using lambda's instead - therefore, this step is not required

The event rule was failing due to the use of aliases for subnets and security groups. This has been resolved in the event rule definition (with comments updated further above in the appropriate set up section).

The rule is not failing now but the service is not redeploying.

I suspect that this is now due to IAM setup, which needs updating with the relevant permissions.

[x] Go to IAM
[x] Find and select the event bridge policy in policies
[x] Edit the Json as follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ecs:RunTask",
                "ecs:UpdateService",
                "ecs:DescribeServices",
                "ecs:DescribeTaskDefinition"
            ],
            "Resource": [
                "arn:aws:ecs:*:231168606641:task-definition/tctalent-test:*",
                "arn:aws:ecs:*:231168606641:task-definition/tctalent-test",
                "arn:aws:ecs:*:231168606641:service/tctalent-test/*",
                "arn:aws:ecs:*:231168606641:cluster/tctalent-test"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "*",
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": "ecs-tasks.amazonaws.com"
                }
            }
        }
    ]
}

[x] Save changes

sadatmalik commented 2 months ago

At this stage decided to use AWS Lambda functions instead of persevering with event rules

Still needs work - the event rule starts a new task:

But what we want is to redeploy the service, which is not happening:

Needs further investigation - may need to use AWS Lambda.

sadatmalik commented 2 months ago

Creating a Lambda function to restart the ECS Fargate service:

Nb: 1 million requests / month - free

[x] Navigate to AWS Lambda console
[x] Click "Create function"
[x] Function Name: ECSServiceRedeployFunction
[x] Runtime: Python 3.12
[x] Architecture: default - x86_64
[x] Default - Create a new role with basic Lambda permissions (should be checked)
[x] Click "Create function"

Add Python code to redeploy ECS service:

import boto3
import json
from datetime import datetime

def lambda_handler(event, context):
    ecs_client = boto3.client('ecs')

    cluster_name = 'tctalent-test'
    service_name = 'tctalent-test'

    response = ecs_client.update_service(
        cluster=cluster_name,
        service=service_name,
        forceNewDeployment=True
    )

    # Convert any datetime objects in the response to strings
    def convert_datetime(obj):
        if isinstance(obj, datetime):
            return obj.isoformat()
        raise TypeError("Type not serializable")

    # Return the response as a JSON serializable object
    return {
        'statusCode': 200,
        'body': json.loads(json.dumps(response, default=convert_datetime))
    }

Check IAM Permissions:

[x] Go to the IAM console
[x] Select the AIM role that was automatically created when configuring the lambda - e.g. ECSServiceRedeployFunction-role-3pc7v6aa
[x] Update the IAM policy to include the ecs:UpdateService action - select the policy, update the permissions (Json) - add the following actions:

    {
            "Effect": "Allow",
            "Action": [
                "ecs:UpdateService",
                "ecs:DescribeServices",
                "ecs:DescribeTaskDefinition"
            ],
            "Resource": [
                "arn:aws:ecs:us-east-1:231168606641:service/tctalent-test/tctalent-test",
                "arn:aws:ecs:us-east-1:231168606641:cluster/tctalent-test"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "*",
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": "ecs-tasks.amazonaws.com"
                }
            }
        }

[x] Select Next then Save changes

Add a resource based policy to the lambda function to allow it to be invoked from cloud watch alarms

[x] Select the lambda function from the Lambda console
[x] In the Configuration tab, in the Permissions section, click "Add permissions"
[x] Grant permission to "AWS service"
[x] Service: Other
[x] Statement ID: AllowCloudWatchInvoke
[x] Principal: lambda.alarms.cloudwatch.amazonaws.com
[x] Source ARN: (cloud watch alarm arn) - e.g. arn:aws:cloudwatch:us-east-1:231168606641:alarm:OOMErrorAlarm
[x] Action: lambda:InvokeFunction

See: https://medium.com/@dithya512m/trigger-aws-lambda-directly-from-cloudwatch-alarm-d9844a410e8c

Notes:

UpdateService - permits a forced new deployment
DescribeServices - allows retrieval of ECS service status for cloud watch logging
DescribeTaskDefinition - allows retrieval of task definition details
iam:PassRole - allows the service and tasks to assume specific IAM roles that they have been configured to run with

sadatmalik commented 2 months ago

Lambda Testing:

[x] Test the lambda works - triggers a service redployment
[x] What happens if the lambda is triggered again mid-redoplyment? This shouldn't happen as the alarm is throttled to 5 minute logging filter interval - but check that it is safe just in case. It works without issue, drains the previous 2 deployments - note it can take a few minutes longer to catch up:

[ ] Remember to revert the log filter pattern back to java.lang.OutOfMemoryError after testing
[ ] Switch metric to 5 minutes (I switched to 1 minute for testing)

sadatmalik commented 2 months ago

Tested and working:

The log filter picks up the OOM exception. Triggering the configured cloud watch alarm, which calls the service deployment lambda function:

And this redeploys the ECS Fargate service:

sadatmalik commented 2 months ago

Prod configuration:

[x] Set up metric filter

[x] Create lambda function
[x] Set up Cloudwatch Alarm

Test:

[x] Confirm service redeploys - use a easily reproducible log pattern e.g. "uid: 141481"

Triggers the lambda:

And redeploys the service:

[ ] Change log metric filter back to "java.lang.OutOfMemoryError" after running tests.

sadatmalik commented 2 months ago

Tettra Doc: https://app.tettra.co/teams/talentbeyondboundaries/pages/aws-processes

Talent-Catalog / talentcatalog