Closed sadatmalik closed 1 month ago
Proposed Solution for Handling OutOfMemoryError
Simple but possibly problematic approach: One potential approach is to use a JVM flag to force the JVM to stop on an OutOfMemoryError. However, this method results in an abrupt stoppage, preventing Fargate from executing a blue-green failover. Consequently, users may experience a few minutes of downtime, during which they could encounter timeout or connection refused errors.
Maintenance Page Suggestion: It would be a good idea to set up a maintenance page in Fargate that can be displayed during complete service downtime. This way, users see a maintenance message instead of encountering timeouts or connection errors if the service fails for any reason - opening a new ticket for this.
Recommended Solution: A better solution involves using AWS CloudWatch capabilities to monitor logs for OutOfMemoryError occurrences. CloudWatch can be set up to detect these errors and generate an alarm that triggers a Fargate service restart. This method allows for blue-green failovers and minimises perceived user downtime.
Needs further investigation, but seems to be a good way to handle the OutOfMemoryError scenario without causing an inadvertent full-service disruption.
Steps required:
Note - the initial solution was to trigger the redeploy via event bridge - but proved much easier (and a neater end to end solution) using lambda functions. I didn't initially go this way due to cost uncertainty - but we have 1 million free lambda calls a month so not a problem for our usage.
Keeping the below task just for documentation - it is not required and should not be reproduced in production. I have deleted the even rule in staging.
Should now have a log metric filter with the following parameters:
Log Filter: Filter Pattern: java.lang.OutOfMemoryError Filter Name: OOMErrorFilter
Metric: Metric namespace: ECS/ContainerInsights Metric name: OOMErrors Metric value: 1
Thus:
[x] Select the metric and choose "Select metric".
[x] Configure the alarm details:
[x] In the "Configure actions" step, add a Lambda action
[x] Provide a name for the alarm ("OOMErrorAlarm").
[x] Review the settings and choose "Create alarm".
Thus:
Manually triggered alarm for testing using filter pattern "uid: 141481"
Create a CloudWatch Event Rule:
Create an EventBridge Rule
Build event pattern:
{
"source": ["aws.cloudwatch"],
"detail-type": ["CloudWatch Alarm State Change"],
"detail": {
"state": {
"value": ["ALARM"]
},
"alarmName": ["OOMErrorAlarm"]
}
}
Select target(s):
[x] Target types: select AWS service
[x] Select a target: Lambda function
[x] Function: ECSServiceReployFunction
[x] Select "Next", "Next", "Create Rule"
[x] Configure Target Details
[x] Compute options
[x] Network Configuration:
[x] Additional settings:
{
"cluster": "$.detail.clusterArn",
"service": "$.detail.serviceArn"
}
{
"cluster": "tctalent-test",
"service": "tctalent-test",
"forceNewDeployment": true
}
Tags:
Thus:
At this stage, the log metric filter is working. The alarm is triggered. The event rule is being invoked, but the invocation is failing. There is no further insight in the AWS console as to why the event invocation is failing.
This step will configure a DLQ (dead letter queue) to capture the failed event invocations to see more detailed information for the failure reason:
Then:
To view messages in the DLQ:
The event rule was failing due to the use of aliases for subnets and security groups. This has been resolved in the event rule definition (with comments updated further above in the appropriate set up section).
The rule is not failing now but the service is not redeploying.
I suspect that this is now due to IAM setup, which needs updating with the relevant permissions.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecs:RunTask",
"ecs:UpdateService",
"ecs:DescribeServices",
"ecs:DescribeTaskDefinition"
],
"Resource": [
"arn:aws:ecs:*:231168606641:task-definition/tctalent-test:*",
"arn:aws:ecs:*:231168606641:task-definition/tctalent-test",
"arn:aws:ecs:*:231168606641:service/tctalent-test/*",
"arn:aws:ecs:*:231168606641:cluster/tctalent-test"
]
},
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "*",
"Condition": {
"StringLike": {
"iam:PassedToService": "ecs-tasks.amazonaws.com"
}
}
}
]
}
Still needs work - the event rule starts a new task:
But what we want is to redeploy the service, which is not happening:
Needs further investigation - may need to use AWS Lambda.
Creating a Lambda function to restart the ECS Fargate service:
Nb: 1 million requests / month - free
Add Python code to redeploy ECS service:
import boto3
import json
from datetime import datetime
def lambda_handler(event, context):
ecs_client = boto3.client('ecs')
cluster_name = 'tctalent-test'
service_name = 'tctalent-test'
response = ecs_client.update_service(
cluster=cluster_name,
service=service_name,
forceNewDeployment=True
)
# Convert any datetime objects in the response to strings
def convert_datetime(obj):
if isinstance(obj, datetime):
return obj.isoformat()
raise TypeError("Type not serializable")
# Return the response as a JSON serializable object
return {
'statusCode': 200,
'body': json.loads(json.dumps(response, default=convert_datetime))
}
Check IAM Permissions:
{
"Effect": "Allow",
"Action": [
"ecs:UpdateService",
"ecs:DescribeServices",
"ecs:DescribeTaskDefinition"
],
"Resource": [
"arn:aws:ecs:us-east-1:231168606641:service/tctalent-test/tctalent-test",
"arn:aws:ecs:us-east-1:231168606641:cluster/tctalent-test"
]
},
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "*",
"Condition": {
"StringLike": {
"iam:PassedToService": "ecs-tasks.amazonaws.com"
}
}
}
Add a resource based policy to the lambda function to allow it to be invoked from cloud watch alarms
See: https://medium.com/@dithya512m/trigger-aws-lambda-directly-from-cloudwatch-alarm-d9844a410e8c
Notes:
Lambda Testing:
Tested and working:
The log filter picks up the OOM exception. Triggering the configured cloud watch alarm, which calls the service deployment lambda function:
And this redeploys the ECS Fargate service:
Prod configuration:
Test:
Triggers the lambda:
And redeploys the service:
As a DevOps I want to ensure that the AWS task restarts upon Out of Memory errors Because we can no longer rely on the service after this state.