aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.08k stars 612 forks source link

Container instance and agent not cleaned up on unclean shutdown #818

Closed acmcelwee closed 6 years ago

acmcelwee commented 7 years ago

This might be more of an ECS scheduler and overall system issue, but this team seems like it might be the best place to start a conversation. We've got a ticket open w/ AWS support, but I think it's worth chatting about here in the open.

A couple days ago, we had an ECS host decide to unexpectedly reboot. In a world of disposable, autoscaled instances and container orchestration, it should've been a non-event, but that was far from reality. When the machine finished the reboot, the docker daemon was unable to restart, because of an issue that was recently fixed in libcontainerd. Since the agent runs as a container, itself, the agent was never able to start up. Here's where the fun started -- the ECS scheduler never marked the agent as offline, and the service tasks that were running on the host went into a weird limbo state. Rather that the cluster scheduler cleaning them up after a reasonable threshold, they continued to stick around in a STOPPED state, and any updates to the service failed because it couldn't stabilize. Because I wanted to debug the issue and get to the root cause, I detached the instance from our ASG to have a replacement scale up, and kept the problematic instance around for investigation. All the while, none of our attempted updates to the ecs service, continue to fail, leaving our cloudformation stacks stuck trying to rollback to the previous state (which it also couldn't do). Finally, I cleaned up the docker daemon state to get it to start up, manually started the ecs agent, and from there, the tasks in limbo disappeared, the service stabilized, and the cloudformation rollbacks succeeded.

The context is ECS-optimized AMIs and ECS services all created w/ cloudformation.

Issues that I observed w/ this:

  1. containerd bug (already fixed, but probably won't see a docker version w/ the fix in the ECS-optimized AMIs for a while)
  2. Container instances kept "alive", even if the agent hasn't been connected for a long time
  3. Tasks in purgatory for these phantom container instances that are effectively offline
  4. Purgatory tasks prevent service stabilization

Thoughts:

It's pretty easy to force a host into this state, just echo b >> /proc/sysrq-trigger and watch it play out.

jhaynes commented 7 years ago

Hi @acmcelwee Thank you for the clear issue description.bug report. I've passed this along to our back-end team and update the ticket here accordingly. Please see the more detailed answer below.

jhaynes commented 7 years ago

I tried to reproduce your issue and see similar behavior, with a slight difference. What I did:

Is this consistent with what you observed?

This is expected, although not obvious, behavior. To explain in more detail: the agent is responsible for submitting container state changes, so if it isn't running or cannot connect to our backend, the state cannot be updated.

The ways you can work around this are:

acmcelwee commented 7 years ago

@jhaynes thanks for digging into this. Yeah, that's exactly what I saw. I went w/ the Fix the Docker state so the ECS agent can start again because there wasn't a natural place to inject any of the other actions w/out a human operator intervening. I'd prefer to kill the instance and start fresh, but for now, just ensuring a clean restart made the most sense.

That said, the fact that tasks in the service continue to report RUNNING state, even when the scheduler hasn't heard from the agent in a long time is really something that feels like it needs improvement. I understand not giving up on tasks prematurely, but at a certain point of no communication from the agent, it seems like the tasks should be reaped and restarted (obviously based on the desired service capacity and such).

The first two options could be done by monitoring the cluster for agents not in a connected state for an extended period of time and triggering a termination. We use lifecycle events and draining already, so for us, terminating equates to drain+terminate.

If this is currently working as designed, I guess consider this a feature request, rather than a bug report. IMHO, you're asking all of the users of your distributed system to repeatedly solve the same fundamental problem on their own, when it's really a cluster orchestrator problem that can be solved at the orchestrator level.

adnxn commented 6 years ago

@acmcelwee Thanks again for your detailed description. I'm closing this issue in favor of #1115.