docker-archive / classicswarm

Swarm Classic: a container clustering system. Not to be confused with Docker Swarm which is at https://github.com/docker/swarmkit
Apache License 2.0
5.75k stars 1.08k forks source link

fix "docker ps" shows stale container status #2939

Closed xinfengliu closed 4 years ago

xinfengliu commented 5 years ago

If the container status-change event is not delivered to swarm manager or not handled successfully by swarm manager due to some reasons (e.g. transient network issue or node issue), the stale status info from docker ps will be there for long time.

The reason is because: https://github.com/docker/swarm/blob/master/api/handlers.go#L580

        tmp.Status = cluster.FullStateString(container.Info.State)

refreshLoop() calls RefreshContainers(false), the param false means container.Info does NOT get updated in ucp-swarm-manager's cache.

On the other hand, refreshLoop() makes container.Container get the latest info from docker engine, so tmp can directly use the Status field of container.Container.

In addition, for container events handling, it will call e.refreshContainer(msg.ID, true) which will update both container.Container and container.Info using the latest info from docker engine. So using the Status from container.Container is always right.

dani-docker commented 5 years ago

@xinfengliu

After further investigation, I don't think we should revert to use the Container.Status this data is not current; As for building the Status from Container.Info.State, this data is supposed to be current as it gets updated through container events or changes that happened to the container through swarm classic apis. I agree that sometimes, due to multiple reasons, the events might not be delivered, which will lead to staled data, but I think we might need to focus our work on how we should recover correctly and reconsume the missed events.

I see few areas that we can look into: 1- If an engine is marked as unhealthy (ie: we failed x times to connect to) when we reconnect we should do a full engine refresh (this should be the case, but we need to confirm)" 2- When the event monitor err out https://github.com/dani-docker/swarm/blob/3320567e1d6cbe84b248f5c236306187279338b3/cluster/event_monitor.go#L27 we should change the api events call to cover a previous time

The above will help with the overall status of the cache , not specific to "status"

But if we want to make the status more "accurate" without addressing the missed events, we can compare the date between Container.Status and Container.Info.State and use the latest.

dani-docker commented 5 years ago

cc: @wsong

dani-docker commented 5 years ago

had a chat with @wsong and he made a good point; Container.Status is updated when Container.Info.State is updated https://github.com/dani-docker/swarm/blob/6ba2bfe667746828f42c9b02a37df5f70548eed8/cluster/engine.go#L906 And if Info update was missed, Container.Status will be updated as part of the refresh loop. In other words Container.Status will be as good or more accurate the Container.Info.State

dani-docker commented 5 years ago

This PR addresses the concern I had with consuming events that we have missed. https://github.com/docker/swarm/pull/2854

thaJeztah commented 5 years ago

@xinfengliu could you rebase your PR? Some fixes went into master to make CI work again

@dani-docker PTAL

xinfengliu commented 5 years ago

@thaJeztah Rebased but still some CI failed.

thaJeztah commented 5 years ago

there's still some flakiness in the tests, no worries; I'll try to restart CI

xinfengliu commented 5 years ago

@thaJeztah , I don't know why the CI failed, do you know how to get CI passed for this PR? Thanks.

xinfengliu commented 5 years ago

@wsong I don't know why the CI failed, could you advise how to get CI passed for this PR? A few customers are waiting for the fix. Thanks.

thaJeztah commented 5 years ago

Looks like something with networking;

# Error starting daemon: Error initializing network controller: Error creating default "bridge" network: unable to remove jump to DOCKER-ISOLATION rule in FORWARD chain:  (iptables failed: iptables --wait -D FORWARD -j DOCKER-ISOLATION: iptables: Resource temporarily unavailable.
wsong commented 5 years ago

Unfortunately the CI in this repo is quite old and stale and has lots of flakiness.

The only test that failed is docker rmi, and the failure appears unrelated. I'd say this is probably fine to merge once it's been approved.