Closed mrapczynski closed 8 years ago
Closing this issue. After a lot of investigation and testing, we are victims of our own environment. Our VMware platform, though large with a lot of available computing power, is experiencing unusual disk latency problems at peak load (primarily during hot backups in the early mornings). These latency problems manifest themselves as very quick bursts, sometimes < 1 sec and other times for several seconds where a VM, and subsequently the K/V store does not respond to heartbeats.
We initially starting using Consul
for the K/V store, but from reading the configuration guide, there are no controls for heartbeat frequency or leader election behavior. Thus when a VM is slowing down, Consul could mislead to believe a node has failed, and this sets off the Swarm manager to begin rescheduling when in reality it should not.
Since discovering this, our K/V backend has been switched etcd
with noticeably better results. We have the heartbeat interval manually set to 3000ms
, and the leader election timeout set to 30000ms
as required, and now our cluster can sustain the latency issues without causing all sorts of added problems by unnecessarily rescheduling containers that did not actually fail.
@mrapczynski Thanks for providing the root cause. We may need to keep an eye on Consul
. cc @abronan.
Sorry, I could not think of a more technical issue title as I'm not quite sure yet what I'm dealing with.
Platform: CentOS 7, Docker 1.10.3 CS (we are now a paying customer), Swarm 1.2, Consul 0.6.4. Size of Swarm is 5 engines, 5 managers (1 on each VM), and 5 Consul instances (1 on each VM).
I'm coming into the office each morning and finding a few of the containers for which I have explicitly enabled auto rescheduling (via an environment variable) have indeed been rescheduled due to what Swarm thought was a node failure. What makes this complicated is the node actually never failed, but I suspect there is something wonky going on either (a) the API calls to Consul, or (b) Consul itself.
To compound the problem, because Swarm thinks the original container went down, Swarm schedules a 2nd unnecessary container to run. Now with two containers running and doing double duty, I'm getting errors from other services that are getting upset for being hit with too much traffic in a given span of time.
I'm starting to wonder if backing my cluster with Consul is a bad idea. When I look through the Consul logs, it seems like it is Consul with the issues and Swarm is just a victim. You could be led to believe that I have worst the platform in the world with nodes coming and going every few minutes. Feel free to speak with honesty on this issue. We are not committed to Consul if it is not reliable.
Logs from Swarm Manager
Logs from Consul