Closed acmcelwee closed 6 years ago
Hi @acmcelwee Thank you for the clear issue description.bug report. I've passed this along to our back-end team and update the ticket here accordingly. Please see the more detailed answer below.
I tried to reproduce your issue and see similar behavior, with a slight difference. What I did:
echo b >> /proc/sysrq-trigger
on the container instance where my tasks are runningIs this consistent with what you observed?
This is expected, although not obvious, behavior. To explain in more detail: the agent is responsible for submitting container state changes, so if it isn't running or cannot connect to our backend, the state cannot be updated.
The ways you can work around this are:
@jhaynes thanks for digging into this. Yeah, that's exactly what I saw. I went w/ the Fix the Docker state so the ECS agent can start again
because there wasn't a natural place to inject any of the other actions w/out a human operator intervening. I'd prefer to kill the instance and start fresh, but for now, just ensuring a clean restart made the most sense.
That said, the fact that tasks in the service continue to report RUNNING
state, even when the scheduler hasn't heard from the agent in a long time is really something that feels like it needs improvement. I understand not giving up on tasks prematurely, but at a certain point of no communication from the agent, it seems like the tasks should be reaped and restarted (obviously based on the desired service capacity and such).
The first two options could be done by monitoring the cluster for agents not in a connected state for an extended period of time and triggering a termination. We use lifecycle events and draining already, so for us, terminating equates to drain+terminate.
If this is currently working as designed, I guess consider this a feature request, rather than a bug report. IMHO, you're asking all of the users of your distributed system to repeatedly solve the same fundamental problem on their own, when it's really a cluster orchestrator problem that can be solved at the orchestrator level.
@acmcelwee Thanks again for your detailed description. I'm closing this issue in favor of #1115.
This might be more of an ECS scheduler and overall system issue, but this team seems like it might be the best place to start a conversation. We've got a ticket open w/ AWS support, but I think it's worth chatting about here in the open.
A couple days ago, we had an ECS host decide to unexpectedly reboot. In a world of disposable, autoscaled instances and container orchestration, it should've been a non-event, but that was far from reality. When the machine finished the reboot, the docker daemon was unable to restart, because of an issue that was recently fixed in libcontainerd. Since the agent runs as a container, itself, the agent was never able to start up. Here's where the fun started -- the ECS scheduler never marked the agent as offline, and the service tasks that were running on the host went into a weird limbo state. Rather that the cluster scheduler cleaning them up after a reasonable threshold, they continued to stick around in a STOPPED state, and any updates to the service failed because it couldn't stabilize. Because I wanted to debug the issue and get to the root cause, I detached the instance from our ASG to have a replacement scale up, and kept the problematic instance around for investigation. All the while, none of our attempted updates to the ecs service, continue to fail, leaving our cloudformation stacks stuck trying to rollback to the previous state (which it also couldn't do). Finally, I cleaned up the docker daemon state to get it to start up, manually started the ecs agent, and from there, the tasks in limbo disappeared, the service stabilized, and the cloudformation rollbacks succeeded.
The context is ECS-optimized AMIs and ECS services all created w/ cloudformation.
Issues that I observed w/ this:
Thoughts:
It's pretty easy to force a host into this state, just
echo b >> /proc/sysrq-trigger
and watch it play out.