Swarm classic randomly shows stale containers, the data eventually gets updated, but that would take hours (depending on failureCount).
Below is the scenario and repro steps needed to get into the bad state.
The problem is that the refreshLoop wait is proportional to the failureCount, even after a successful connection, the loop might take hours to resume.
This fix forces the loop to resume after an engine becomes Healthy.
Note: I made some changes into CheckConnectionErr to avoid race condition and unnecessary blocks.
Summary
1- simulate a connection failure to a worker node using iptables
2- generate enough failure counts ( for ex: up to 415), which will cause the refresh loop to wait for around 3.5 hours. The 415 failures causes the engine to be marked as unhealthy
3- start a container
4- Open the firewall, this causes event monitor to resume and eventually runs CheckConnectionErr due to false/positive event exec_start that runs refreshContainer
5- now that the refreshLoop is hung/waiting for 3.5 hours, simulate another connection failure to the same engine, now go ahead and delete the container created in 3 and reopen the firewall (the key here is not to get the engine unhealthy, otherwise an engine_reconnect event is emitted and this will result of full containers refresh).
6- 1,2,3,4,5 leads to stale data, container 3 still showing, although it's deleted.
Detail
For simplicity, these steps can be reproduced on a cluster of one manager and one or more worker
1- On the worker node, simulate engine disconnect by running the following iptable rules (important to use --reject-with tcp-rese):
sudo iptables -I FORWARD -p tcp --dport 12376 -j REJECT --reject-with tcp-resetsudo iptables -I FORWARD -p tcp --dport 2376 -j REJECT --reject-with tcp-reset
2- On the manager node configure your docker client to query swarm classic directly
3- Assuming you have container-name-refresh-filter set to test123 . Run the following docker ps with a the filter that forces a container refresh that eventually fails and increases the failure count (in this example, this will generate 900 failures)
for i in {1..900}; do docker ps -f name="test123"; done
4- On the worker node, create a container:
docker run -d alpine ping 127.0.0.1
5- Reopen the firewall
sudo iptables -D FORWARD 1sudo iptables -D FORWARD 1
and sudo iptables -L --line-numbers to confirm that the rules from step 1 are gone
6- run docker ps against swarm classic and confirm that the container in step 4 is listed
7- go back to the worker node and repeat step 1
8- delete the container created in step 4
9- repeat step 5 to open the firewall
10- run docker ps against swarm classic, notice how the deleted container is still listed. Wait few hours until the refresh loop is released, notice how the container list is up to date.
Swarm classic randomly shows stale containers, the data eventually gets updated, but that would take hours (depending on
failureCount
). Below is the scenario and repro steps needed to get into the bad state.The problem is that the refreshLoop wait is proportional to the
failureCount
, even after a successful connection, the loop might take hours to resume. This fix forces the loop to resume after an engine becomesHealthy
. Note: I made some changes intoCheckConnectionErr
to avoid race condition and unnecessary blocks.Summary
1- simulate a connection failure to a worker node using iptables 2- generate enough failure counts ( for ex: up to 415), which will cause the refresh loop to wait for around 3.5 hours. The 415 failures causes the engine to be marked as
unhealthy
3- start a container 4- Open the firewall, this causes event monitor to resume and eventually runsCheckConnectionErr
due to false/positive eventexec_start
that runsrefreshContainer
5- now that therefreshLoop
is hung/waiting for 3.5 hours, simulate another connection failure to the same engine, now go ahead and delete the container created in 3 and reopen the firewall (the key here is not to get the engineunhealthy
, otherwise anengine_reconnect
event is emitted and this will result of full containers refresh). 6- 1,2,3,4,5 leads to stale data, container 3 still showing, although it's deleted.Detail
For simplicity, these steps can be reproduced on a cluster of one manager and one or more worker 1- On the worker node, simulate engine disconnect by running the following iptable rules (important to use
--reject-with tcp-rese
):sudo iptables -I FORWARD -p tcp --dport 12376 -j REJECT --reject-with tcp-reset
sudo iptables -I FORWARD -p tcp --dport 2376 -j REJECT --reject-with tcp-reset
2- On the manager node configure your docker client to query swarm classic directly 3- Assuming you havecontainer-name-refresh-filter
set totest123
. Run the followingdocker ps
with a the filter that forces a container refresh that eventually fails and increases the failure count (in this example, this will generate 900 failures)for i in {1..900}; do docker ps -f name="test123"; done
4- On the worker node, create a container:docker run -d alpine ping 127.0.0.1
5- Reopen the firewallsudo iptables -D FORWARD 1
sudo iptables -D FORWARD 1
andsudo iptables -L --line-numbers
to confirm that the rules from step 1 are gone 6- rundocker ps
against swarm classic and confirm that the container in step 4 is listed 7- go back to the worker node and repeat step 1 8- delete the container created in step 4 9- repeat step 5 to open the firewall 10- rundocker ps
against swarm classic, notice how the deleted container is still listed. Wait few hours until the refresh loop is released, notice how the container list is up to date.