Closed shanet closed 1 year ago
Hello -- we recently pulled in updates to the cni-plugins which may be causing this issue. I'm looking into it now.
Can you provide steps to repro? I'm also working on reproducing this error on my end.
re: CNI_CONTAINERID missing
I see where this became required in the CNI: https://github.com/containernetworking/cni/commit/a0a74bbcd34edc39f0cff3ed0cfb171974dfbefa
https://github.com/containernetworking/cni/tree/03e5b5ff9e3bec384cd3dba889d59d2fd5015da7 <- this is the last commit before the behavior change.
@fierlion For sure, the steps to reproduce were fairly straightforward for me:
/var/log/ecs/ecs-agent.log
on the EC2 host about failing to cleanup the network resources & IP addresses.The issue with tasks failing to start was more difficult to reproduce. In our case it took a few days between deployment of the new ECS agent and it throwing the errors about running out of IP addresses. I'd imagine that stopping/starting enough tasks would trigger it but I'm unsure what that threshold is exactly.
The error log is from here https://github.com/aws/amazon-ecs-agent/blob/master/agent/engine/task_manager.go#L768-L770 after calling this method https://github.com/aws/amazon-ecs-agent/blob/master/agent/ecscni/plugin_linux.go#L102, but inside the method we always explicitly set the ContainerID
That's where I was getting lost too. ReleaseIPResource
sets the container ID with ContainerID: cfg.ContainerID
, but I didn't see where cfg.ContainerID
was set initially. I assumed it should be somewhere in BuildCNIConfigAwsvpc
but I didn't see anything there.
https://github.com/aws/amazon-ecs-agent/blob/13d1e854decd15a8839ea76cc1cd1b73b9f9d03b/agent/engine/task_manager.go#L758-L760 It looks like we're not setting it. As it previously wasn't a required field this makes sense. We still aren't using it, so we should be able to set a static value and get it to work. I'll have a commit shortly and will build an agent to share via public ECR that we can all test with.
Pushed an image to public ECR which sets the container ID to container-id
This is based on the changes here: https://github.com/aws/amazon-ecs-agent/pull/3563
You can try it out:
docker pull public.ecr.aws/fierlion-aws/agent-cni-test:latest
(This is still WIP)
Hello Is it safe to use ecs-agent version 1.68.1 in production in the meantime? We were going to update our ECS production cluster currently using version 1.67.2 but we found out about this new open issue. Thanks
For what it's worth I've experienced the same issue this past Monday and have an actively open case w/ AWS Support
I can corroborate that cycling all instances in our ECS Cluster (with them still being on agent version: 1.68.1
AMI: amzn2-ami-ecs-hvm-2.0.20230127-x86_64-ebs
) seems to have knocked something into alignment, ~but it's unclear whether or not the TaskFailedToStart
errors will be coming back to haunt us. Things have remained stable for us for the past couple of days on that AMI.~
EDIT: We observed these errors again, and have since reverted to an AMI housing an ecs-agent<1.68.1
with success 👍
I'm happy to provide more context if requested 👍
We have reverted the CNI plugins update in this PR: https://github.com/aws/amazon-ecs-agent/pull/3565 and are currently working to release it.
Agent 1.68.2 will have a fix for this once it's ready.
As a mitigation, you can downgrade your ECS Agent in place First, stop the ecs service:
sudo systemctl stop ecs
then you can pull the previous agent image either from Dockerhub or Public ECR:
docker pull amazon/amazon-ecs-agent:v1.68.0
or
docker pull public.ecr.aws/ecs/amazon-ecs-agent:v1.68.0
Tag that image as latest
:
docker tag amazon/amazon-ecs-agent:v1.68.0 amazon/amazon-ecs-agent:latest
now start ecs again and check the running agent version (via the agent introspection API):
sudo systemctl start ecs
curl -s 127.0.0.1:51678/v1/metadata | python -mjson.tool
this should now show
"Version": "Amazon ECS Agent - v1.68.0
While the above release is still pending, you can update multiple instances in your cluster at once, (or just one,) with the SSM send-command API. This assumes you have the SSM Agent installed and running on your hosts. It is installed by default in the ECS Optimized AMI.
You'll want to find all the instance IDs in your ECS cluster. These will be your targets
:
aws ssm send-command \
--document-name "AWS-RunShellScript" \
--targets '[{"Key":"InstanceIds","Values":["instance-id-1", "instance-id-2"]}]' \
--parameters '{"commands":["#!/bin/bash","systemctl stop ecs","docker pull amazon/amazon-ecs-agent:v1.68.0","docker tag amazon/amazon-ecs-agent:v1.68.0 amazon/amazon-ecs-agent:latest","systemctl start ecs","curl -s 127.0.0.1:51678/v1/metadata | python -mjson.tool"]}' \
--output text
check the output of the above invocation to be sure it has succeeded:
aws ssm list-command-invocations \
--command-id "<command id from the above send-command>" \
--details \
--output text
1.68.2 Agent and AMIs have been published with the CNI rollback. I confirmed by pulling the latest parameters from us-east-1:
aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/recommended --region us-east-1
...
"Value": "{\"ecs_agent_version\":\"1.68.2\",\"ecs_runtime_version\":\"Docker version 20.10.17\"
thanks for your patience and the issue report, closing this out as 1.68.2 is now released
So just as a heads-up, this isn't fixed. We upgraded to 1.68.2 via the ECS console and still have an issue with containers launching and grabbing IPs when using AWS VPC. We have confirmed this is still an issue with one of the ECS support persons.
Error from our log:
level=warn time=2023-02-03T19:13:12Z msg="Error starting/provisioning container[%s (Runtime ID: %s)];" runtimeID="cd706915fe88145c0fcc8f9f512514930970beb6dd1aa8c58ebda062151f16cb" error="container resource provisioning: failed to setup network namespace: bridge ipam ADD: failed to execute plugin: ecs-ipam: getIPV4AddressFromDB comman…
Our cni plugin ipam management allocate/remove functions depend on Docker's libkv key value store.
To start you'll want to transition any affected container instances to DRAINING so new tasks aren't scheduled on the instance. You'll then want to stop all tasks/containers on the instance.
At this point your best option might be to just replace your container instance with an instance of the latest ECS Optimized AMI (if that's an option).
otherwise:
Stop both ecs-init and docker:
sudo systemctl stop docker
sudo systemctl stop ecs
Now clear the existing ipam key value store. Note that this will clear out all ecs-managed IP addresses which may lead to unexpected consequences if you have any tasks or containers on the instance, so be sure that all existing task containers are stopped.
sudo rm /var/lib/ecs/data/eni-ipam.db
Please follow the instructions provided above and report if you have any issues further, closing this out.
Summary
Tasks fail to start with version 1.68.1 due to failure to release IP addresses from IPAM database.
Description
Last week we ran into an issue with ECS where tasks would fail to start. There was no information in the ECS console other than
task failed to start
but upon checking the ECS logs on the EC2 host under/var/log/ecs/ecs-agent.log
I saw the following:This was despite IP addresses being available in the subnet. A reboot of the server temporarily fixed the problem.
Looking more into the logs I also saw the following:
This appears the cause of the IP address error above. Namely, the ECS agent is failing to release IPs from its IPAM database and not deleting stopped container's network namespaces.
I tried reading through the code to determine why this became a problem last week, but nothing stood out to me except that the ContainerID seems to be null or 0 leading to the
CNI_CONTAINERID missing
error and the odd/host/proc/0/ns/net
path (because why would it be checking PID 0?). It seems like this newer version is having a problem with not getting the container ID and then failing to clean up the network resources of the pause container? Or something like that; this is where my knowledge of the ECS agent becomes too hazy.I also noticed that the ECS agent was updated from version 1.68.0 to 1.68.1 and the changelog specifically noted that the CNI plugins were updated which is the problematic area. A downgrade of the ECS agent we are running to v1.68.0 resulted in the above errors in the log disappearing so I concluded that the issue must be with the CNI plugin update from v.1.68.1 which was published to Docker Hub last week.
Expected Behavior
IP addresses should be released and tasks should start
Observed Behavior
IP addresses are not released and tasks will eventually fail to start.
Environment Details
Docker version: 23.0.0 OS: Ubuntu (kernel 5.15.0-1028-aws)
Supporting Log Snippets