Open zxkane opened 2 weeks ago
Hi @zxkane, thanks for opening up this issue.
You could see abnormal connectivity of between ECS and agent on EC2,
level=info time=2024-10-15T05:19:50Z msg="Sending state change to ECS" eventType="TaskStateChange" taskArn="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskStatus="MANIFEST_PULLED" taskReason="" taskPullStartedAt="0001-01-01T00:00:00Z" taskPullStoppedAt="0001-01-01T00:00:00Z" taskKnownSentStatus="NONE" taskExecutionStoppedAt="0001-01-01T00:00:00Z" containerChange-0="containerName=ecs-service-connect-nconQf containerStatus=RUNNING containerKnownSentStatus=NONE containerRuntimeID=fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b containerIsEssential=true" level=info time=2024-10-15T05:21:12Z msg="TCS Websocket connection closed for a valid reason"
What you see here is actually the expected behavior where agent disconnecting with the ECS telemetry connection (see ref where this log statement is coming from). This is also not the same ECS endpoint where we send state changes over (it's the ACS endpoint). The TCS endpoint is where we send over metrics. Agent will periodically disconnect and then reconnect back with the telemetry endpoint which you should see a corresponding log statement a bit after.
Could you help clarify a bit more on what you mean by the essential container being stuck in pending? It looks like the container did transition to a running state.
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" changeEventStatus="RESOURCES_PROVISIONED" knownStatus="RUNNING" desiredStatus="RESOURCES_PROVISIONED"
If possible, could share a bit more on how the task definition is configured?
Summary
Every thing works well. However, the essential container always is pending after enabling service connect on ECS on EC2.
While updating the ECS service to Fargate, it works again with service connect enabled.
I found the ECS agent lost the connection to ECS service when 'sending status change to ECS'.
Description
I'm using below code snippet to create ECS service on EC2 with service connect via CDK,
If the service connect is disabled, the container started and ran well. However, it always is pending after enabling service connect.
After inspecting the logs of
ecs-agent
container on EC2, I found below output in logs.You could see abnormal connectivity of between ECS and agent on EC2,
After updating above ECS service to use Fargate, it works well.
Expected Behavior
The essential container could be started.
Observed Behavior
The ECS agent could not start the container after the service connect container is started.
Environment Details
ECS agent: 1.87.0 EC2 AMI: amzn2-ami-ecs-hvm-2.0.20241010-x86_64-ebs
Supporting Log Snippets
see above description