AWS console way out of sync with instance state

workybee commented 8 years ago

The AWS console "Task" tab shows ~48 tasks, but instances have only 3. My naive understanding is that the ecs-agent is what the AWS console uses to know what is happening on the instances, hence the query here.

A "docker ps -a" on all the instances shows nothing related to a task mentioned on the console.

I am doing an hourly purge of meta-data ( i.e.: "docker ps -a -q | xargs docker rm" on each instance )to help keep the disk footprint low, prevent docker from locking up completely. I suspect this may be the causing the issue, but how to do that without the console (and hence AWS's notion of memory reservation etc. getting all messed up?

And/Or how to re-sync things?

workybee commented 8 years ago

@aaithal - perhaps you have some comment, relates a bit to the space issue you helped me with in issue #309

aaithal commented 8 years ago

You could use the stop-task API from the aws cli or the SDK to stop tasks.

The AWS console "Task" tab shows ~48 tasks

Could you please share the status(Last status and Desired status) of tasks that are not running on the instance?

workybee commented 8 years ago

Both status columns in the console showed up as "Running" . I upgraded the ecs-agent to version 1.8.0 and things look better so far. Thank you for the help.

workybee commented 8 years ago

@aaithal Sorry, forgot to respond to first part of your coment above: Pretty sure I tried something like "ecs stop-task --task ffffffff-uuuu-nnnn-kkkk-yyyyggggoooo" with no success. Choosing "Stop All" from the "Tasks" tab on the "Cluster" page did seem to clear them out though. (task-id changed for fun, and in case it leaks anything of import)

workybee commented 8 years ago

I spoke to soon. @aaithal or anyone else have ideas? Seems when I finally upgraded all 4 instances to ecs-agent 1.8.0 everything broke. This morning I find the console (and aws cli) thinks some 40 tasks are associated with the cluster, 27 running, 13 in pending state. Looking on the actual cluster instances - nothing running at all except the ecs-agent according to docker and ps, and the cpu usage has been flat all night confirming it is not doing much if anything. ;-)

A sample of describe-tasks output:

aws ecs describe-tasks --cluster macc --tasks 010ebe32-99a6-42f3-ac85-61440d6af71f
{
    "failures": [],
    "tasks": [
        {
            "taskArn": "arn:aws:ecs:us-east-1:860733554218:task/010ebe32-99a6-42f3-ac85-61440d6af71f",
            "overrides": {
                "containerOverrides": [
                    {
                        "name": "django_mentad"
                    }
                ]
            },
            "lastStatus": "RUNNING",
            "containerInstanceArn": "arn:aws:ecs:us-east-1:860733554218:container-instance/54659fea-2db6-43a3-9044-540f1712b13f",
            "createdAt": 1455762905.788,
            "clusterArn": "arn:aws:ecs:us-east-1:860733554218:cluster/macc",
            "startedAt": 1455762907.83,
            "desiredStatus": "RUNNING",
            "taskDefinitionArn": "arn:aws:ecs:us-east-1:860733554218:task-definition/prod_new_collect_ads_-u_262317968:4",
            "containers": [
                {
                    "containerArn": "arn:aws:ecs:us-east-1:860733554218:container/f33092df-3534-4ca2-b2d3-13a53ddbcffd",
                    "taskArn": "arn:aws:ecs:us-east-1:860733554218:task/010ebe32-99a6-42f3-ac85-61440d6af71f",
                    "lastStatus": "RUNNING",
                    "name": "django_mentad",
                    "networkBindings": []
                }
            ]
        }
    ]
}

and for a PENDING task:

aws ecs describe-tasks --cluster macc --tasks 21629223-5b44-4d2d-bcb3-34fe2a955974
{
    "failures": [],
    "tasks": [
        {
            "taskArn": "arn:aws:ecs:us-east-1:860733554218:task/21629223-5b44-4d2d-bcb3-34fe2a955974",
            "overrides": {
                "containerOverrides": [
                    {
                        "name": "django_mentad"
                    }
                ]
            },
            "lastStatus": "PENDING",
            "containerInstanceArn": "arn:aws:ecs:us-east-1:860733554218:container-instance/7a6bbe7b-ec41-4f34-ab5c-6e6ad7c2f714",
            "createdAt": 1455761670.714,
            "clusterArn": "arn:aws:ecs:us-east-1:860733554218:cluster/macc",
            "desiredStatus": "RUNNING",
            "taskDefinitionArn": "arn:aws:ecs:us-east-1:860733554218:task-definition/prod_new_collect_ads_-u_115973711880080:6",
            "containers": [
                {
                    "containerArn": "arn:aws:ecs:us-east-1:860733554218:container/1cfb409d-d38b-4c87-b80b-4f8203d9f393",
                    "taskArn": "arn:aws:ecs:us-east-1:860733554218:task/21629223-5b44-4d2d-bcb3-34fe2a955974",
                    "lastStatus": "PENDING",
                    "name": "django_mentad"
                }
            ]
        }
    ]
}

workybee commented 8 years ago

OK, so the ecs-agent, though still running had lost connectivity with whatever in AWS keeps track of things (ECS Cluster page showed "Agent Connected" "false"). Running "sudo stop ecs; sleep 10; sudo start ecs" on each instance seemed to get things flowing again. Logs I should look at? (was not running agent in DEBUG mode... ECS_LOGFILE is not set) In each instance I found lots of "Error" lines like:

[...]
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-05:2016-02-18T05:16:27Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-11:2016-02-18T11:25:30Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-11:2016-02-18T11:25:30Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-16:2016-02-18T16:56:33Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-16:2016-02-18T16:56:33Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-18:2016-02-18T18:42:53Z [CRITICAL] Error saving state before final shutdown module="TerminationHandler" err="Multiple error:
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-18:2016-02-18T18:51:25Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:47f2fcd76a16af1974a904fea1c0717813c17c1489a7193557f0477d174c0026}"

Not sure how relevant these are.

ChrisRut commented 8 years ago

I just experienced similar behavior after upgrading my instances to v1.8.0, I set a number of my clusters desiredCount to 0, and the console shows there are no more running tasks, but after logging into the EC2 instances I can clearly see the tasks running, and the only relevant logs are the ones @workybee posted (I also was not running in DEBUG mode). It seems there is a bug in the 1.8.0 version of the agent that is causing a disconnect between what the AWS API has and what is actually running on the instances. I'll try and reproduce this issue in a testing environment with DEBUG on and see if I can't get back to you with additional logs, but this is very troubling...

samuelkarp commented 8 years ago

@workybee Do you have more complete logs? The lines you've posted are not relevant; nothing that mentions tcs relates to task delivery.

@ChrisRut If you're able to reproduce with debug logs, that would be very helpful.

tschutte commented 8 years ago

I have managed to duplicate this on my cluster. What I thought was an issue I referenced in https://github.com/aws/amazon-ecs-agent/issues/281, actually looks to be more related to this issue. When a service deploy is initiated, it appears the agent does not stop the running container, however UI shows the new task is in a pending state. If I do nothing more than manually stop and start the ecs agent, the deploy continues as normal. Note I just updated to the latest AMI (2015.09.f) yesterday from 2015.09.e which did not have the issue.

The time line is as follows: 21:17:22 Updated the service (via a cloudformation stack update) 21:27:58 Got tired of wating and manual restarted the ecs agent Deploy finished.

I've attached a debug log to the issue, and have scrubbed most identifying info. If someone on the ECS team wants the raw log so that they can cross reference to my running ECS cluster let me know. ecs-agent.log.2016-02-18-21.txt

samuelkarp commented 8 years ago

@tschutte This looks like the most interesting part to me:

2016-02-18T21:27:58Z [CRITICAL] Error saving state before final shutdown module="TerminationHandler" err="Multiple error:
    0: Timed out waiting for TaskEngine to settle"

If this happens again, can you run docker kill -s USR1 ecs-agent before restarting it? It'd also be good to get the Docker daemon logs and anything suspicious in the output of dmesg. It's unclear if what you're seeing is the same as this issue or not.

workybee commented 8 years ago

So after about 1 hour things locked up again with the AWS console thinking many tasks still running, but CPU metrics, docker and raw ps all saying nothing of interest still running. Am now trying to get DEBUG logs. Have added ECS_LOGFILE=/ecs-agent.log to the end of /etc/ecs/ecs.config. When I restarted the ecs-agent ("sudo stop ecs ; sleep 5 ; sudo start ecs" on each instance) I don't seem to get anything into the log file, but all of a sudden things start flowing again. Happy to try other efforts, recommendations? Should I upgrade from 2015.09.e to 2015.09.f?

samuelkarp commented 8 years ago

Can you unset ECS_LOGFILE and just change ECS_LOGLEVEL=debug? Logs should be mounted by default into /var/log/ecs so you should see them there without messing with ECS_LOGFILE.

2015.09.f fixes a bug with the Agent that might help (and also includes the fix for the glibc vulnerability), so I would recommend updating.

workybee commented 8 years ago

After something less than 53 minutes of uptime, things have frozen up in an inconsistent state again. The console thinks there are 8 PENDING tasks, and 48 RUNNING, docker ps, and ps directly on the instances shows nothing except the ecs-agent running. Will upgrade to 2015.09.f and try ECS_LOGLEVEL=debug, post what that produces in an hour or so.

workybee commented 8 years ago

So am getting logs now, haven't seen things lock up yet ;-) Took a while to get new instances properly configured, so still time... will be happy if things keep going for another hour (or at least start to feel better).

workybee commented 8 years ago

This new combo of 2015.09.f and ecs-agent version 1.8.0 running with debug enabled seems to be staying in sync between the AWS console and what is actually happening on the cluster instances. Can hope it keeps going through the night - but I'll not be looking hard till tomorrow. ;-) Thanks for the suggestions today. @samuelkarp

tschutte commented 8 years ago

@samuelkarp Thanks. I'll see if I can view one last service update failure and obtain the output of docker kill. Note there was nothing of interest in dmesg or the docker logs other than the standard "GET" messages.

My team is getting frustrated with the frequent failing service updates, so I will also be rolling back to a version of the AMI which is more stable for me. I'll attempt to replicate the issue again in a personal test environment. If you think this issue is not related, I'll move out of this issue and create my own.

tschutte commented 8 years ago

@samuelkarp Here is a agent log with the output of the kill command. Note the service update fired at around 14:17:51 according to the ECS UI, but it reported an AGENT error:

service MyService-1OXQ7AOIF1ZZT was unable to place a task because no container instance met all of its requirements. The closest matching container-instance 4ac6193c-8165-409f-9ab3-b769637b2749 encountered error "AGENT". For more information, see the Troubleshooting section.

I finally ran the docker kill at 14:37. Let me know if you want me to continue in this issue or move to a different one. ecs-agent.log.2016-02-19-14.txt

samuelkarp commented 8 years ago

@tschutte Thank you, that output is super helpful. Can you let me know how many containers you had running on this host? (You should be able to see this in the data file located at /var/lib/ecs/data/ecs_agent_data.json.) Here's what I think is happening:

The engine package inside the Agent maintains state and attempts to prevent concurrency problems by locking the state whenever it's read or modified (this lock). In the output you provided, I see tons of goroutines waiting to acquire a read lock (which indicates that the write lock is held) and residual effects elsewhere (channels waiting to send/receive that are blocked because the engine is waiting on the lock already).
While normally we expect to receive state changes from the Docker daemon via the event stream, we also have a reconciliation mechanism to periodically inspect each of the containers under management and ensure that we have the state correctly (this method). That method acquires the state lock before iterating through all known containers.
Each time we inspect a container, we expect Docker to return a result within 30 seconds.
I'm guessing (and I'll keep investigating to confirm) that Docker is responding extremely slowly (potentially taking the full 30 seconds) and thus that method is holding the lock for 30 seconds multiplied by the total number of known containers (this includes new containers, running containers, and exited containers that have not yet been removed).

TL;DR: I think Docker might be responding very slowly and causing the Agent to hang. I will keep investigating to either confirm or deny.

tschutte commented 8 years ago

@samuelkarp There are a total of 3 containers on each host, 1 for the ecs agent, and 2 as part of my task definition with a shared volume between them. I think I have also reproduced this issue using a couple of small sample containers that just loop in a while loop. I'm working now to button up the task definition and will upload a test container to docker hub so you can hopefully observe this behavior for yourself. Stay tuned.

I did go digging through some of the older logs in the hour or two proceeding the error and notice something that I think correlates to the stuck lock. I found a case each time where the log entry "Writing docker event to the associated task" was missing the corresponding "Wrote docker event to the associated task" (from docker_task_engine.go line 382).

samuelkarp commented 8 years ago

@tschutte That's 2 in the task definition, but how many times has that task been run on the host? Each time a task is run, it creates new containers. I'm interested in the total number of known containers, which you'll find in that data file. Something like this should show it:

sudo yum -y install jq
sudo cat /var/lib/ecs/data/ecs_agent_data.json | jq '.Data.TaskEngine.Tasks[].Containers[].Name' | wc -l

What you're noting about the log lines is indeed relevant; that's one of the residual effects (channel blocked because there's no active reader) of the lock being held a long time.

tschutte commented 8 years ago

@samuelkarp sorry, I misunderstood the question. The last remaining qa host I have reports a value of 12.

tschutte commented 8 years ago

ok, I have a basic task definition which seems to reliably recreate the issue. I did find I have to wait past the task cleanup wait duration (I set it low here for testing). Basically launch a task, wait 12-15 minutes, then update the task definition and watch the service enter the weird state, where the UI shows no tasks running, but the task is still present on the host via "docker ps".

sample userdata (launched on a barebones 2015.09.f ami) - https://gist.github.com/tschutte/3182549229dc0ca7563c sample task defintion (using 2 tasks with a shared volume, I have no idea if this is or is not contributing to the problem) - https://gist.github.com/tschutte/3cc9db05da951dea5af4

samuelkarp commented 8 years ago

@tschutte If it's only 12, that may invalidate my hypothesis (I saw goroutines blocked for around 700 minutes, implying that there would be something on the order of 1400 containers each taking 30 seconds to timeout). I'll try your reproduction task today.

tschutte commented 8 years ago

Nope, that environment only deploys 3-4 times a day max, times the 2-3 containers. These are very low volume hosts overall, and is why I'm pleased my reproduction task worked on a fairly vanilla EC2 instance launched today. Hopefully you are successful reproducing.

euank commented 8 years ago

@samuelkarp In response to your theory three comments ago: the synchonizeState method you link is only called during the engine's Init, so shouldn't matter.

I have a different theory after glancing through those logs.

First of all, I sorted all the goroutines by how long they'd been hanging to try and figure out when this might have started; grep -Eo "[0-9]+ minutes" | sort -nr | head (skipping the goroutines from within a minute of launch over 1000 mins ago) gives us 849 minutes. The goroutines from then:

goroutine 43 [chan send, 849 minutes]:
github.com/aws/amazon-ecs-agent/agent/engine.func·012()
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_container_engine.go:636 +0x44b
created by github.com/aws/amazon-ecs-agent/agent/engine.(*dockerGoClient).ContainerEvents
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_container_engine.go:638 +0x4e4

goroutine 47 [semacquire, 849 minutes]:
sync.(*RWMutex).Lock(0xc2080622b8)
    /usr/src/go/src/sync/rwmutex.go:87 +0xa4
github.com/aws/amazon-ecs-agent/agent/engine.(*managedTask).cleanupTask(0xc208297140, 0x9d29229e000)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/task_manager.go:441 +0x467
github.com/aws/amazon-ecs-agent/agent/engine.(*managedTask).overseeTask(0xc208297140)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/task_manager.go:164 +0xd13
created by github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).startTask
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:317 +0x199

goroutine 48 [chan send, 849 minutes]:
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).handleDockerEvents(0xc208062240, 0x7f54409a5c08, 0xc208270f80)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:383 +0x819
created by github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).Init
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:131 +0x17a

The bits that matter are docker_task_engine.go:383 (if you're on that line of code, you're holding the processTasks.RLock) and task_manager.go:441 (waiting on the processTasks.RLock). :boom: deadlock.

So, how'd this happen? Well, my guess is if you get a docker event for a task and the engine task manager has decided to delete that task, you get a deadlock.

The handleDockerEvents code holds the lock for both the task lookup and the message sending. If it only held it for the task lookup, this deadlock couldn't possibly happen. Regardless, that's not a problem when the taskManager is properly reading events ASAP as they come in.
The taskManager, during cleanup, stops listening to events for the duration of sweepTask, which is a serial slow operation. That gives a nice window during which a new event can cause a deadlock.

Frankly, reading this code, I'm surprised it doesn't reliably deadlock here with a large number of containers. It looks like in my usual test environment (mmm overlay), you end up with removes being so quick this doesn't happen in practice so much. There might be further intricacies as well.

However, after being surprised it doesn't deadlock easily, I threw a 10-container task at it, waited the cleanup duration (I set mine to 60s to speed it along), and immediately deadlocked on removal.

That repro works reliably for me with the ECS optimized AMI (3/3 tries), and works slightly less reliably on an AMI with a sane FS configuration. If it doesn't work for you, just make docker rm slower (without slowing down docker inspect too much as that's needed to emit an event and trigger this deadlock).

You can identify that what I describe is the root cause of some breakage by looking at debug output and verifying that a Writing docker event to the associated task debug log appears, but Wrote docker event doesn't appear shortly after, with Removing container stuff happening for the task around then.

I'll give a go at writing a test that properly models this problem, and then write a fix.

ChrisRut commented 8 years ago

@samuelkarp ,

I finally got around to reproducing this in a test environment, it sounds like @euank is already working on a fix for this issue, but incase you need additional data I have attached my ecs-agent.log w/ Debugging enabled.

313_ecs-agent.log.txt

In the mean-time (while we wait for @euank 's fix) would you recommend downgrading the agent/AMI, or is there perhaps another work-around we can use to avoid this issue ?

workybee commented 8 years ago

@euank 's theory is supported by our experience of even more frequent lock ups when we tried removing old tasks ourselves before ecs-agent version 1.8.0 supported: ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=30m

A semi-workaround we are trying: cron job that stop/starts ecs every 30 minutes (with potentially bad side effects we haven't thought of/seen yet, so use at your own risk ;-)

CameronGo commented 8 years ago

Question that is hopefully not out of line: In a case such as this where there is a major bug that disrupts core functionality, would it make sense to re-publish the previous version (v1.7.1 in this case) with the "latest" tag, so that launch configs and autoscaling groups that are not using version pinning remain in a functional state?

jeekajoo commented 8 years ago

I'm experiencing this issue since I upgraded from 1.7.1 to 1.8.0. stop/start fixes it, but I guess it is temporary and that's better to downgrade.

seiffert commented 8 years ago

We were also experiencing this issue after upgrading to v1.8.0. Almost all cluster machines reported the ECS agent to be disconnected which caused containers that terminated for some reason not to be restarted on another machine. I consider this a bug that caused disruption of production environments. We worked around this by downgrading to v1.7.1 again. However, I'd really like to see a v1.8.1 release with the fix merged in (#320).

apsey commented 8 years ago

We are also having the same issue with ECS agent disconnection. We are downgrading our agent's version to 1.7.1 until this issue is fixed.

juanrhenals commented 8 years ago

Fixing this deadlock issue has been released as part of Agent v1.8.1. We also released a new AMI with the v1.8.1 agent bundled.

If you continue to run into issues or have new information, please do reopen this or a new issue. For now, I don't think we have any additional action items for this issue on our end.

euank commented 8 years ago

After thinking about it a little, I do have one more question related to this issue: basically, what caused it?

The code that was patched wasn't introduced in 1.8.0 (and the changes for 1.8.0 were actually pretty small, based on git diff v1.7.1..v1.8.0 -- . ':!/agent/Godeps/_workspace/' ':!*_test.go' ':!*/service.go' ':!*/api.go' ':!*_mocks.go').

Why did downgrading to v1.7.1 help at all? The things I see as possibilities here:

AWS SDK upgrade somehow changed timing enough this became a problem
AMI differences related to storage/1.9 between the 1.7.1 and 1.8.0 AMIs triggered it
some sort of strange packaging/build issue
???

If it's fixed, it's fixed, and it's not vital to understand all the details of "whyyyyy", but it'd still be cool to know!

I have no plan to investigate further, but if you do I'd like to hear the results.

aaithal commented 8 years ago

We have a theory that this was a side-effect of this commit, which was intended to fix losing container IDs in case of inspect errors. With this fix, we are now not missing container IDs anymore. But, the state transition for the same in terms of container clean-up leads up to the deadlock. You can test it by reverting that commit in the v1.8.0 agent and using the task definition that you posted, with cleanup time set to 1minute.

n1t1nv3rma commented 8 years ago

Fixing this deadlock issue has been released as part of Agent v1.8.1. AWS has also released a new AMI with the v1.8.1 agent bundled. http://docs.aws.amazon.com/AmazonECS/latest/developerguide/container_agent_versions.html#ecs-optimized-ami-agent-versions

alexmac commented 8 years ago

@n1t1nv3rma seems like the amazon yum repo hasn't been updated with this yet? following the update procedure here: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-update.html#procedure_update_ecs-init

I get: sudo yum install -y ecs-init Loaded plugins: priorities, update-motd, upgrade-helper 27 packages excluded due to repository priority protections Package ecs-init-1.8.0-1.amzn1.x86_64 already installed and latest version Nothing to do

If I update manually using the ECS console I get 1.8.1

ChrisRut commented 8 years ago

@alexmac , You should be able to update your entire fleet by using the following bash one-liner w/ aws-cli and jq:

for cluster in $(aws ecs list-clusters | jq -r .clusterArns[] | cut -d/ -f2); do echo "--- CLUSTER: ${cluster} ---"; for instance in $(aws ecs list-container-instances --cluster $cluster | jq -r .containerInstanceArns[] | cut -d/ -f2); do echo "--- INSTANCE: ${instance} ---" ; aws ecs update-container-agent --cluster ${cluster} --container-instance ${instance} ; sleep 1 ; done; done

I don't believe the ecs-init package needs to be updated in-order to avoid the deadlocks.

nikhilo commented 8 years ago

@ChrisRut I'm seeing following exception when trying above one liner for agents v1.8.0 A client error (NoUpdateAvailableException) occurred when calling the UpdateContainerAgent operation: There is no update available for your container agent. Note: We are running a custom CentOS 7 AMI

ChrisRut commented 8 years ago

@nikhilo , I think you meant that for me, sorry, that one-liner is assuming you are running the "ECS Optimized AMI"

aws / amazon-ecs-agent

AWS console way out of sync with instance state #313