Tasks fail to start with version 1.68.1 due to failure to release IP addresses from IPAM database

shanet commented 1 year ago

Summary

Tasks fail to start with version 1.68.1 due to failure to release IP addresses from IPAM database.

Description

Last week we ran into an issue with ECS where tasks would fail to start. There was no information in the ECS console other than task failed to start but upon checking the ECS logs on the EC2 host under /var/log/ecs/ecs-agent.log I saw the following:

level=warn time=2023-02-03T19:13:12Z msg="Error starting/provisioning container[%s (Runtime ID: %s)];" runtimeID="cd706915fe88145c0fcc8f9f512514930970beb6dd1aa8c58ebda062151f16cb" error="container resource provisioning: failed to setup network namespace: bridge ipam ADD: failed to execute plugin: ecs-ipam: getIPV4AddressFromDB commands: failed to get available ip from the db: getAvailableIP ipstore: failed to find available ip addresses in the subnet\nadd network failed

This was despite IP addresses being available in the subnet. A reboot of the server temporarily fixed the problem.

Looking more into the logs I also saw the following:

level=info time=2023-02-04T00:41:43Z msg="Managed task has reached stopped; waiting for container cleanup" task="fc50c79e949945a8855d63d2f3c27524"
level=info time=2023-02-04T00:41:43Z msg="Cleaning up the network namespace" task="fc50c79e949945a8855d63d2f3c27524" container="~internal~ecs~pause"
level=error time=2023-02-04T00:41:43Z msg="Delete network failed: failed to Statfs \"/host/proc/0/ns/net\": no such file or directory" module=plugin.go
level=error time=2023-02-04T00:41:43Z msg="Unable to cleanup pause container network namespace" task="fc50c79e949945a8855d63d2f3c27524" error="delete network failed: failed to Statfs \"/hosfile or directory"
level=info time=2023-02-04T00:41:43Z msg="IPAM releasing ip for task eni" task="fc50c79e949945a8855d63d2f3c27524"
level=error time=2023-02-04T00:41:43Z msg="Failed to release ip; IPAM error" task="fc50c79e949945a8855d63d2f3c27524" error="required env variables [CNI_CONTAINERID] missing"

This appears the cause of the IP address error above. Namely, the ECS agent is failing to release IPs from its IPAM database and not deleting stopped container's network namespaces.

I tried reading through the code to determine why this became a problem last week, but nothing stood out to me except that the ContainerID seems to be null or 0 leading to the CNI_CONTAINERID missing error and the odd /host/proc/0/ns/net path (because why would it be checking PID 0?). It seems like this newer version is having a problem with not getting the container ID and then failing to clean up the network resources of the pause container? Or something like that; this is where my knowledge of the ECS agent becomes too hazy.

I also noticed that the ECS agent was updated from version 1.68.0 to 1.68.1 and the changelog specifically noted that the CNI plugins were updated which is the problematic area. A downgrade of the ECS agent we are running to v1.68.0 resulted in the above errors in the log disappearing so I concluded that the issue must be with the CNI plugin update from v.1.68.1 which was published to Docker Hub last week.

Expected Behavior

IP addresses should be released and tasks should start

Observed Behavior

IP addresses are not released and tasks will eventually fail to start.

Environment Details

Docker version: 23.0.0 OS: Ubuntu (kernel 5.15.0-1028-aws)

Supporting Log Snippets

level=warn time=2023-02-03T19:13:12Z msg="Error starting/provisioning container[%s (Runtime ID: %s)];" runtimeID="cd706915fe88145c0fcc8f9f512514930970beb6dd1aa8c58ebda062151f16cb" error="container resource provisioning: failed to setup network namespace: bridge ipam ADD: failed to execute plugin: ecs-ipam: getIPV4AddressFromDB commands: failed to get available ip from the db: getAvailableIP ipstore: failed to find available ip addresses in the subnet\nadd network failed

level=info time=2023-02-04T00:41:43Z msg="Managed task has reached stopped; waiting for container cleanup" task="fc50c79e949945a8855d63d2f3c27524"
level=info time=2023-02-04T00:41:43Z msg="Cleaning up the network namespace" task="fc50c79e949945a8855d63d2f3c27524" container="~internal~ecs~pause"
level=error time=2023-02-04T00:41:43Z msg="Delete network failed: failed to Statfs \"/host/proc/0/ns/net\": no such file or directory" module=plugin.go
level=error time=2023-02-04T00:41:43Z msg="Unable to cleanup pause container network namespace" task="fc50c79e949945a8855d63d2f3c27524" error="delete network failed: failed to Statfs \"/hosfile or directory"
level=info time=2023-02-04T00:41:43Z msg="IPAM releasing ip for task eni" task="fc50c79e949945a8855d63d2f3c27524"
level=error time=2023-02-04T00:41:43Z msg="Failed to release ip; IPAM error" task="fc50c79e949945a8855d63d2f3c27524" error="required env variables [CNI_CONTAINERID] missing"

fierlion commented 1 year ago

Hello -- we recently pulled in updates to the cni-plugins which may be causing this issue. I'm looking into it now.

Can you provide steps to repro? I'm also working on reproducing this error on my end.

fierlion commented 1 year ago

re: CNI_CONTAINERID missing I see where this became required in the CNI: https://github.com/containernetworking/cni/commit/a0a74bbcd34edc39f0cff3ed0cfb171974dfbefa

fierlion commented 1 year ago

https://github.com/containernetworking/cni/tree/03e5b5ff9e3bec384cd3dba889d59d2fd5015da7 <- this is the last commit before the behavior change.

shanet commented 1 year ago

@fierlion For sure, the steps to reproduce were fairly straightforward for me:

Add an EC2 instance to a cluster running v1.68.1 of the ECS agent
Start some tasks on it
Stop those task from the ECS console
Observe the errors from above in /var/log/ecs/ecs-agent.log on the EC2 host about failing to cleanup the network resources & IP addresses.

The issue with tasks failing to start was more difficult to reproduce. In our case it took a few days between deployment of the new ECS agent and it throwing the errors about running out of IP addresses. I'd imagine that stopping/starting enough tasks would trigger it but I'm unsure what that threshold is exactly.

fierlion commented 1 year ago

The error log is from here https://github.com/aws/amazon-ecs-agent/blob/master/agent/engine/task_manager.go#L768-L770 after calling this method https://github.com/aws/amazon-ecs-agent/blob/master/agent/ecscni/plugin_linux.go#L102, but inside the method we always explicitly set the ContainerID

shanet commented 1 year ago

That's where I was getting lost too. ReleaseIPResource sets the container ID with ContainerID: cfg.ContainerID, but I didn't see where cfg.ContainerID was set initially. I assumed it should be somewhere in BuildCNIConfigAwsvpc but I didn't see anything there.

fierlion commented 1 year ago

https://github.com/aws/amazon-ecs-agent/blob/13d1e854decd15a8839ea76cc1cd1b73b9f9d03b/agent/engine/task_manager.go#L758-L760 It looks like we're not setting it. As it previously wasn't a required field this makes sense. We still aren't using it, so we should be able to set a static value and get it to work. I'll have a commit shortly and will build an agent to share via public ECR that we can all test with.

fierlion commented 1 year ago

Pushed an image to public ECR which sets the container ID to container-id This is based on the changes here: https://github.com/aws/amazon-ecs-agent/pull/3563 You can try it out: docker pull public.ecr.aws/fierlion-aws/agent-cni-test:latest

(This is still WIP)

okossuth commented 1 year ago

Hello Is it safe to use ecs-agent version 1.68.1 in production in the meantime? We were going to update our ECS production cluster currently using version 1.67.2 but we found out about this new open issue. Thanks

scottx611x commented 1 year ago

For what it's worth I've experienced the same issue this past Monday and have an actively open case w/ AWS Support

I can corroborate that cycling all instances in our ECS Cluster (with them still being on agent version: 1.68.1 AMI: amzn2-ami-ecs-hvm-2.0.20230127-x86_64-ebs) seems to have knocked something into alignment, ~but it's unclear whether or not the TaskFailedToStart errors will be coming back to haunt us. Things have remained stable for us for the past couple of days on that AMI.~

EDIT: We observed these errors again, and have since reverted to an AMI housing an ecs-agent<1.68.1 with success 👍

I'm happy to provide more context if requested 👍

sparrc commented 1 year ago

We have reverted the CNI plugins update in this PR: https://github.com/aws/amazon-ecs-agent/pull/3565 and are currently working to release it.

Agent 1.68.2 will have a fix for this once it's ready.

fierlion commented 1 year ago

As a mitigation, you can downgrade your ECS Agent in place First, stop the ecs service:

sudo systemctl stop ecs

then you can pull the previous agent image either from Dockerhub or Public ECR:

docker pull amazon/amazon-ecs-agent:v1.68.0

or

docker pull public.ecr.aws/ecs/amazon-ecs-agent:v1.68.0

Tag that image as latest:

docker tag amazon/amazon-ecs-agent:v1.68.0 amazon/amazon-ecs-agent:latest

now start ecs again and check the running agent version (via the agent introspection API):

sudo systemctl start ecs
curl -s 127.0.0.1:51678/v1/metadata | python -mjson.tool

this should now show

 "Version": "Amazon ECS Agent - v1.68.0

fierlion commented 1 year ago

While the above release is still pending, you can update multiple instances in your cluster at once, (or just one,) with the SSM send-command API. This assumes you have the SSM Agent installed and running on your hosts. It is installed by default in the ECS Optimized AMI. You'll want to find all the instance IDs in your ECS cluster. These will be your targets:

aws ssm send-command \
    --document-name "AWS-RunShellScript" \
    --targets '[{"Key":"InstanceIds","Values":["instance-id-1", "instance-id-2"]}]' \
    --parameters '{"commands":["#!/bin/bash","systemctl stop ecs","docker pull amazon/amazon-ecs-agent:v1.68.0","docker tag amazon/amazon-ecs-agent:v1.68.0 amazon/amazon-ecs-agent:latest","systemctl start ecs","curl -s 127.0.0.1:51678/v1/metadata | python -mjson.tool"]}' \
    --output text

check the output of the above invocation to be sure it has succeeded:

aws ssm list-command-invocations \
    --command-id "<command id from the above send-command>" \
    --details \
    --output text

fierlion commented 1 year ago

1.68.2 Agent and AMIs have been published with the CNI rollback. I confirmed by pulling the latest parameters from us-east-1:

aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/recommended --region us-east-1
...
"Value": "{\"ecs_agent_version\":\"1.68.2\",\"ecs_runtime_version\":\"Docker version 20.10.17\"

sparrc commented 1 year ago

thanks for your patience and the issue report, closing this out as 1.68.2 is now released

rmccarthy-ellevation commented 1 year ago

So just as a heads-up, this isn't fixed. We upgraded to 1.68.2 via the ECS console and still have an issue with containers launching and grabbing IPs when using AWS VPC. We have confirmed this is still an issue with one of the ECS support persons.

Error from our log:

level=warn time=2023-02-03T19:13:12Z msg="Error starting/provisioning container[%s (Runtime ID: %s)];" runtimeID="cd706915fe88145c0fcc8f9f512514930970beb6dd1aa8c58ebda062151f16cb" error="container resource provisioning: failed to setup network namespace: bridge ipam ADD: failed to execute plugin: ecs-ipam: getIPV4AddressFromDB comman…

fierlion commented 1 year ago

Our cni plugin ipam management allocate/remove functions depend on Docker's libkv key value store.

To start you'll want to transition any affected container instances to DRAINING so new tasks aren't scheduled on the instance. You'll then want to stop all tasks/containers on the instance.

At this point your best option might be to just replace your container instance with an instance of the latest ECS Optimized AMI (if that's an option).

otherwise:

Stop both ecs-init and docker:

sudo systemctl stop docker
sudo systemctl stop ecs

Now clear the existing ipam key value store. Note that this will clear out all ecs-managed IP addresses which may lead to unexpected consequences if you have any tasks or containers on the instance, so be sure that all existing task containers are stopped.

sudo rm /var/lib/ecs/data/eni-ipam.db

SreeeS commented 1 year ago

Please follow the instructions provided above and report if you have any issues further, closing this out.

aws / amazon-ecs-agent