aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.08k stars 612 forks source link

runtime error: index out of range #1237

Closed cgouteix closed 6 years ago

cgouteix commented 6 years ago

Summary

After upgrading to 1.17.0, the agent container is enable to start on some randomly instances.

Description

I'm using ECS and an EC2 Autoscaling Group. All my containers have docker's healthcheck on.

I'm using ami-63cbae0c (amzn-ami-2017.09.h-amazon-ecs-optimized) in eu-central-1. I was on ami-eacf5d85 (amzn-ami-2017.09.f-amazon-ecs-optimized) with no such trouble.

After upgrading to 1.17.0 I get this behavior randomly (some EC2 instances and all):

Environment Details

$ docker info
Containers: 35
 Running: 29
 Paused: 0
 Stopped: 6
Images: 6
Server Version: 17.09.1-ce
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3kB
 Base Device Size: 10.74GB
 Backing Filesystem: ext4
 Data file: 
 Metadata file: 
 Data Space Used: 2.904GB
 Data Space Total: 23.33GB
 Data Space Available: 20.42GB
 Metadata Space Used: 2.707MB
 Metadata Space Total: 25.17MB
 Metadata Space Available: 22.46MB
 Thin Pool Minimum Free Space: 2.333GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.77-31.58.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.09
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.67GiB
Name: ip-10-93-4-161
ID: YFQZ:3NTN:N2RV:LFFW:UTLS:SWUY:5UES:324C:ZB5I:67LA:TI76:NMAA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
$ curl http://localhost:51678/v1/metadata
{"Cluster":"xxx","ContainerInstanceArn":"arn:aws:ecs:eu-central-1:xxx","Version":"Amazon ECS Agent - v1.17.0 (761937f7)"}
$ df -h
Sys. de fichiers Taille Utilisé Dispo Uti% Monté sur
/dev/xvda1         7,8G    769M  7,0G  10% /
devtmpfs           7,9G    196K  7,9G   1% /dev
tmpfs              7,9G       0  7,9G   0% /dev/shm

Supporting Log Snippets

Very hard to catch, because the agent container is recreated every few seconds:

$ docker logs -f ecs-agent
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/aws/amazon-ecs-agent/agent/engine.metadataFromContainer(0xc4204f2000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_client.go:792 +0x7b2
github.com/aws/amazon-ecs-agent/agent/engine.(*dockerGoClient).DescribeContainer(0xc420281b80, 0xc4201cb8c0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_client.go:590 +0x101
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).synchronizeContainerStatus(0xc4202a8000, 0xc42029bb90, 0xc4202a8140)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:349 +0x5c7
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).synchronizeState(0xc4202a8000)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:276 +0x295
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).Init(0xc4202a8000, 0x1002040, 0xc42001a028, 0x2, 0x410bf7)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:199 +0xf6
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).MustInit.func1(0x0, 0x0)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:227 +0x56
github.com/aws/amazon-ecs-agent/agent/utils.RetryWithBackoffCtx(0x1002040, 0xc42001a028, 0xffda00, 0xc420313500, 0xc4200ebc68, 0xc4200ebc01, 0xc420313500)
    /go/src/github.com/aws/amazon-ecs-agent/agent/utils/utils.go:130 +0x87
github.com/aws/amazon-ecs-agent/agent/utils.RetryWithBackoff(0xffda00, 0xc420313500, 0xc4200ebc68, 0xf, 0xc4203a6200)
    /go/src/github.com/aws/amazon-ecs-agent/agent/utils/utils.go:113 +0x57
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).MustInit(0xc4202a8000, 0x1002040, 0xc42001a028)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:223 +0x151
github.com/aws/amazon-ecs-agent/agent/app.(*ecsAgent).doStart(0xc420203e00, 0xc4201fc770, 0x1001180, 0xc420292980, 0x1006780, 0xc4201da0f0, 0x1003c40, 0xc4201f2fc0, 0x1003280, 0xc4201da1e0, ...)
    /go/src/github.com/aws/amazon-ecs-agent/agent/app/agent.go:285 +0x42c
github.com/aws/amazon-ecs-agent/agent/app.(*ecsAgent).start(0xc420203e00, 0xc42001a028)
    /go/src/github.com/aws/amazon-ecs-agent/agent/app/agent.go:199 +0x339
github.com/aws/amazon-ecs-agent/agent/app.Run(0xc42000e180, 0x0, 0x0, 0x0)
    /go/src/github.com/aws/amazon-ecs-agent/agent/app/run.go:64 +0x15b
main.main()
    /go/src/github.com/aws/amazon-ecs-agent/agent/agent.go:29 +0x63
cgouteix commented 6 years ago

it is happening again in my pre-production environment

$ docker logs -f ecs-agent
2018-02-16T17:05:16Z [INFO] Loading configuration
2018-02-16T17:05:16Z [INFO] Loading state! module="statemanager"
2018-02-16T17:05:16Z [INFO] Event stream ContainerChange start listening...
2018-02-16T17:05:16Z [INFO] Restored cluster 'cmdpre'
2018-02-16T17:05:16Z [INFO] Restored from checkpoint file. I am running as 'arn:aws:ecs:eu-central-1:559122992834:container-instance/73a4c4aa-a1de-4030-b709-6e2020fc137f' in cluster 'cmdpre'
2018-02-16T17:05:16Z [INFO] Registered container instance with cluster!
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/aws/amazon-ecs-agent/agent/engine.metadataFromContainer(0xc4205de600, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_client.go:792 +0x7b2
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).synchronizeContainerStatus(0xc4202b2000, 0xc4202856e0, 0xc4202b2640)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:337 +0x336
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).synchronizeState(0xc4202b2000)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:276 +0x295
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).Init(0xc4202b2000, 0x1002040, 0xc42001a028, 0x2, 0x410bf7)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:199 +0xf6
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).MustInit.func1(0x0, 0x0)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:227 +0x56
github.com/aws/amazon-ecs-agent/agent/utils.RetryWithBackoffCtx(0x1002040, 0xc42001a028, 0xffda00, 0xc4204ab410, 0xc420347c68, 0xc420347c01, 0xc4204ab410)
    /go/src/github.com/aws/amazon-ecs-agent/agent/utils/utils.go:130 +0x87
github.com/aws/amazon-ecs-agent/agent/utils.RetryWithBackoff(0xffda00, 0xc4204ab410, 0xc420347c68, 0xf, 0xc42027dc40)
    /go/src/github.com/aws/amazon-ecs-agent/agent/utils/utils.go:113 +0x57
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).MustInit(0xc4202b2000, 0x1002040, 0xc42001a028)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:223 +0x151
github.com/aws/amazon-ecs-agent/agent/app.(*ecsAgent).doStart(0xc420276e00, 0xc42019a930, 0x1001180, 0xc42025f720, 0x1006780, 0xc420066c80, 0x1003c40, 0xc4202a81b0, 0x1003280, 0xc420066d70, ...)
    /go/src/github.com/aws/amazon-ecs-agent/agent/app/agent.go:285 +0x42c
github.com/aws/amazon-ecs-agent/agent/app.(*ecsAgent).start(0xc420276e00, 0xc42001a028)
    /go/src/github.com/aws/amazon-ecs-agent/agent/app/agent.go:199 +0x339
github.com/aws/amazon-ecs-agent/agent/app.Run(0xc42000e180, 0x0, 0x0, 0x0)
    /go/src/github.com/aws/amazon-ecs-agent/agent/app/run.go:64 +0x15b
main.main()
    /go/src/github.com/aws/amazon-ecs-agent/agent/agent.go:29 +0x63
$ docker info
Containers: 31
 Running: 20
 Paused: 0
 Stopped: 11
Images: 5
Server Version: 17.09.1-ce
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3kB
 Base Device Size: 10.74GB
 Backing Filesystem: ext4
 Data file: 
 Metadata file: 
 Data Space Used: 2.4GB
 Data Space Total: 23.33GB
 Data Space Available: 20.93GB
 Metadata Space Used: 2.392MB
 Metadata Space Total: 25.17MB
 Metadata Space Available: 22.77MB
 Thin Pool Minimum Free Space: 2.333GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.77-31.58.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.09
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.798GiB
Name: ip-10-93-5-190
ID: YX3L:UVEW:F2GM:PW73:ZPH7:C4BC:FB3U:3P22:CUUM:UJU2:YPM2:VXTX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
$ curl http://localhost:51678/v1/metadata
curl: (7) Failed to connect to localhost port 51678: Connexion refusée
adnxn commented 6 years ago

@cgouteix good catch, thanks for bringing this to our attention.

adnxn commented 6 years ago

also -

I'm not able to reproduce this error on my end and some more logging data would be very helpful for debugging this issue.

Do you mind sharing your container image metadata as well? Specifically I'm curious about the healthcheck command and how the output from that is interacting with the agent. Would you be able to send me logs from the problematic instance? If this is easily reproducible for you, I'd also suggest running the agent with ECS_LOGLEVEL=debug.

You can capture logs using our log collection tool and email them to me directly at adnkha at amazon dot com. If you end up sending the logs to my email, please update this issue so I don't miss them.

Thanks 😄

cgouteix commented 6 years ago

I'll try to reproduced it and give you more info. But for now I had to roll back to the previous ami.

I think maybe it has something to do with Scheduled Tasks. I use them a lot.

cgouteix commented 6 years ago

So it seems that I was right. It happens with scheduled tasks. I can't say if ecs-agent crashes after the task started or after it ended, but it did crash. And I was able to let the ecs-agent restart after deleting the stopped container that was created for the scheduled task. I'll send you my ecs logs by email, but I don't think it can help much.

dynaum commented 6 years ago

We are facing the same problem here, but this error occurs when agent try to stop a task with two containers.

garrickstapleszed commented 6 years ago

+1

I'm seeing a lot of these after upgrading to 1.17.0. I'm anxiously awaiting 1.17.1.

adnxn commented 6 years ago

The latest release 1.17.1 includes a fix for this issue.