Closed xinfengliu closed 4 years ago
@xinfengliu
After further investigation, I don't think we should revert to use the Container.Status
this data is not current; As for building the Status
from Container.Info.State
, this data is supposed to be current as it gets updated through container events or changes that happened to the container through swarm classic apis.
I agree that sometimes, due to multiple reasons, the events might not be delivered, which will lead to staled data, but I think we might need to focus our work on how we should recover correctly and reconsume the missed events.
I see few areas that we can look into: 1- If an engine is marked as unhealthy (ie: we failed x times to connect to) when we reconnect we should do a full engine refresh (this should be the case, but we need to confirm)" 2- When the event monitor err out https://github.com/dani-docker/swarm/blob/3320567e1d6cbe84b248f5c236306187279338b3/cluster/event_monitor.go#L27 we should change the api events call to cover a previous time
The above will help with the overall status of the cache , not specific to "status"
But if we want to make the status
more "accurate" without addressing the missed events, we can compare the date between Container.Status
and Container.Info.State
and use the latest.
cc: @wsong
had a chat with @wsong and he made a good point; Container.Status
is updated when Container.Info.State
is updated https://github.com/dani-docker/swarm/blob/6ba2bfe667746828f42c9b02a37df5f70548eed8/cluster/engine.go#L906
And if Info update was missed, Container.Status
will be updated as part of the refresh loop.
In other words Container.Status
will be as good or more accurate the Container.Info.State
This PR addresses the concern I had with consuming events that we have missed. https://github.com/docker/swarm/pull/2854
@xinfengliu could you rebase your PR? Some fixes went into master to make CI work again
@dani-docker PTAL
@thaJeztah Rebased but still some CI failed.
there's still some flakiness in the tests, no worries; I'll try to restart CI
@thaJeztah , I don't know why the CI failed, do you know how to get CI passed for this PR? Thanks.
@wsong I don't know why the CI failed, could you advise how to get CI passed for this PR? A few customers are waiting for the fix. Thanks.
Looks like something with networking;
# Error starting daemon: Error initializing network controller: Error creating default "bridge" network: unable to remove jump to DOCKER-ISOLATION rule in FORWARD chain: (iptables failed: iptables --wait -D FORWARD -j DOCKER-ISOLATION: iptables: Resource temporarily unavailable.
Unfortunately the CI in this repo is quite old and stale and has lots of flakiness.
The only test that failed is docker rmi
, and the failure appears unrelated. I'd say this is probably fine to merge once it's been approved.
If the container status-change event is not delivered to swarm manager or not handled successfully by swarm manager due to some reasons (e.g. transient network issue or node issue), the stale status info from
docker ps
will be there for long time.The reason is because: https://github.com/docker/swarm/blob/master/api/handlers.go#L580
refreshLoop()
callsRefreshContainers(false)
, the paramfalse
meanscontainer.Info
does NOT get updated inucp-swarm-manager
's cache.On the other hand,
refreshLoop()
makescontainer.Container
get the latest info from docker engine, sotmp
can directly use theStatus
field ofcontainer.Container
.In addition, for container events handling, it will call
e.refreshContainer(msg.ID, true)
which will update bothcontainer.Container
andcontainer.Info
using the latest info from docker engine. So using theStatus
fromcontainer.Container
is always right.