balena-os / balena-supervisor

Balena Supervisor: balena's agent on devices.
https://balena.io
Other
150 stars 63 forks source link

Supervisor fails to delete a network and local mode push hangs #2370

Open majorz opened 2 months ago

majorz commented 2 months ago

On a CI/CD system where local mode is used a few times a week balena push hangs because of the following problem with supervisor/balenaEngine:

Device state apply error Error: Failed to apply state transition steps. (HTTP code 403) unexpected - error while removing network: network <NAME> id <ID> has active endpoints  Steps:["removeNetwork","removeNetwork"]

This is an instance of https://github.com/moby/moby/issues/42119

It is a problem in Docker's libnetwork where its internal state gets out of sync possibly due to some racing problem or unclean exit. This leads to Docker refusing to delete the network in question.

The only workaround that worked is restarting the docker daemon. Tried different less intrusive operations, but those did not work (docker network prune --force, docker system prune --force, or adding a minimal container, attaching the network to it and detaching it to see whether the reference count will be cleared, etc.).

Searched extensively for other possible solutions or workarounds, but none exist yet. The real fix needs to be in libnetwork, but the moby issue is stale.

pipex commented 2 months ago

Is hard to see what the supervisor could do here, since there doesn't seem to be mechanisms to solve this via the docker API. A reboot or an engine restart are options, but that also means blindly interrupting the device operation which is unlikely to be something we want to do.