aws / amazon-ecs-ami

Packer recipes for building the official ECS-optimized Amazon Linux AMIs
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html
Apache License 2.0
176 stars 41 forks source link

[ECS] [Container OOM]: Containers OOM with Amazon Linux 2023 ECS AMI #240

Open rixwan-sharif opened 2 months ago

rixwan-sharif commented 2 months ago

Community Note

Tell us about your request ECS Containers are getting killed due to Out of Memory with new Amazon Linux 2023 ECS AMI.

Which service(s) is this request for? ECS - with EC2 (Autoscaling and Capacity Provider setup)

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? We are deploying our workloads on AWS ECS (EC2 based). We recently migrated our underlying cluster instances AMI to Amazon Linux 2023 (previously using Amazon Linux 2). After the migration, we are facing a lot of "OOM Container Killed" for our services without any change on the service side.

Are you currently working around this issue?

rgoltz commented 2 months ago

Hey @rixwan-sharif + all others with 👍

Thanks for the info and sharing your experience! We are a normal ECS-on-EC2 user as well - We are currently still in the planning phase for the switch from Amazon Linux 2 to Amazon Linux 2023. We also have a large number of containers. It's possible for your to add further details about your setup here? (ECS: Setting of Limits on TaskDefinition and/or Container-Level // Images: Container OS used (like Alpine, Distroless, ...), Application Framework in the container (e.g. Java/Sprint Boot or NodeJS, etc) // Adding some Metrics using AL2 vs. AL2023, etc.)

We asked AWS ECS team via support case whether there is a known error or behavior like you described here - they said no.

Please note that the issue raised is not a known issue internally. Also there are no known issues related to Amazon Linux 2023 for out of memory behaviour.

Did you also forward the problem as a support case?

Thanks, Robert

sparrc commented 1 month ago

Hello, I have transferred this issue to the ECS/EC2 AMI repo from containers-roadmap, since this sounds more like it could be a bug or change in behavior in the AMI, rather than a feature request.

@rixwan-sharif could you let us know which AL2023 AMI version you used? Was it the latest available? Could you also provide the task and container limits that you have in your task definition(s)?

Two differences that come to mind that may be relevant are the latest AL2023 AMI is using Docker 25.0 and cgroups v2, whereas the latest AL2 AMI is currently on Docker 20.10 and cgroups v1.

sparrc commented 1 month ago

If you were not using the latest AL2023 AMI, one thing to note is that the Amazon Linux team released a systemd fix in late-September 2023 for a bug in the cgroup OOM-kill behavior. (Search "systemd" in release notes here: https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.2.20230920.html)

sparrc commented 1 month ago

If anyone has data to provide to the engineering team that can't be shared here, please feel free to email it to ecs-agent-external (at) amazon (dot) com, thank you :)

rixwan-sharif commented 1 month ago

Hi, This is the AMI version we are using

AMI Version : al2023-ami-ecs-hvm-2023.0.20240409-kernel-6.1-x86_64

Task/Container details

Base Docker Image: adoptopenjdk/openjdk14:x86_64-debian-jdk-14.0.2_12 Application Framework: Java/Sprint Boot

Resources:

CPU : 0.125 Memory(hard): 1GB

Docker stats on Amazon Linux 2 AMI.

image

Docker stats on Amazon Linux 2023 AMI. (Increased Memory hard limit to 3GB as container was OOMing with 1GB of memory)

image

And yes we already opened a support case too (Case ID 171387184800518). this is what we got from support.

[+] Upon further troubleshooting, we found that there seems to be issue with AL2023 AMI which our internal team is already working on and below are wordings shared by them:

We are following up on your inquiry regarding increased container OOM (out-of-memory) kills when using ECS-Optimized AL2023 AMI, comparing to AL2 AMI. The team is investigating to identify what has caused the OOM kill behavior change. We suspect a combination of Cgroupv2 and container runtime updates. As a workaround, we recommend that customer adjust container and task memory hard limit to have a larger buffer, based on their container memory usage patterns, using ECS Container Insights metrics.

egemen-touchstay commented 2 weeks ago

Hi @sparrc, after switching to AML 2023 from AML 2 we faced w/ a similar issue as well, we haven't got OOM yet but the memory consumption has nearly doubled, and memory consumption regularly increases as the application runs which looks like a memory leak. AMI Version: al2023-ami-ecs-hvm-2023.0.20240515-kernel-6.1-x86_64 Base Docker Image: python:3.8-slim Application Framework: Python / Django / uwsgi