Force resume refreshLoop when engine status changes to healthy

dani-docker commented 6 years ago

Swarm classic randomly shows stale containers, the data eventually gets updated, but that would take hours (depending on failureCount). Below is the scenario and repro steps needed to get into the bad state.

The problem is that the refreshLoop wait is proportional to the failureCount, even after a successful connection, the loop might take hours to resume. This fix forces the loop to resume after an engine becomes Healthy. Note: I made some changes into CheckConnectionErr to avoid race condition and unnecessary blocks.

Summary

1- simulate a connection failure to a worker node using iptables 2- generate enough failure counts ( for ex: up to 415), which will cause the refresh loop to wait for around 3.5 hours. The 415 failures causes the engine to be marked as unhealthy 3- start a container 4- Open the firewall, this causes event monitor to resume and eventually runs CheckConnectionErr due to false/positive event exec_start that runs refreshContainer 5- now that the refreshLoop is hung/waiting for 3.5 hours, simulate another connection failure to the same engine, now go ahead and delete the container created in 3 and reopen the firewall (the key here is not to get the engine unhealthy, otherwise an engine_reconnect event is emitted and this will result of full containers refresh). 6- 1,2,3,4,5 leads to stale data, container 3 still showing, although it's deleted.

Detail

For simplicity, these steps can be reproduced on a cluster of one manager and one or more worker 1- On the worker node, simulate engine disconnect by running the following iptable rules (important to use --reject-with tcp-rese): sudo iptables -I FORWARD -p tcp --dport 12376 -j REJECT --reject-with tcp-reset sudo iptables -I FORWARD -p tcp --dport 2376 -j REJECT --reject-with tcp-reset 2- On the manager node configure your docker client to query swarm classic directly 3- Assuming you have container-name-refresh-filter set to test123 . Run the following docker ps with a the filter that forces a container refresh that eventually fails and increases the failure count (in this example, this will generate 900 failures) for i in {1..900}; do docker ps -f name="test123"; done 4- On the worker node, create a container: docker run -d alpine ping 127.0.0.1 5- Reopen the firewall sudo iptables -D FORWARD 1 sudo iptables -D FORWARD 1 and sudo iptables -L --line-numbers to confirm that the rules from step 1 are gone 6- run docker ps against swarm classic and confirm that the container in step 4 is listed 7- go back to the worker node and repeat step 1 8- delete the container created in step 4 9- repeat step 5 to open the firewall 10- run docker ps against swarm classic, notice how the deleted container is still listed. Wait few hours until the refresh loop is released, notice how the container list is up to date.

dani-docker commented 6 years ago

ping @nishanttotla

dani-docker commented 6 years ago

PR is updated with a test case.

rchourasia commented 6 years ago

ping @marcusmartins for assigning reviewer.

docker-archive / classicswarm

Force resume refreshLoop when engine status changes to healthy #2860

Summary

Detail