balena-os / balena-engine

Moby-based Container Engine for Embedded, IoT, and Edge uses
https://www.balena.io
Apache License 2.0
696 stars 67 forks source link

Update libnetwork to fix port binding issue #428

Closed lmbarros closed 1 year ago

lmbarros commented 1 year ago

This updates balena-libnetwork to a version that should fix some port binding issues that may happen after balenaEngine or device crashes. Specifically, this balena-libnetwork version cherry-picks this unmerged upstream patch (with minor changes to make it compatible with recent Moby versions).

I cannot comment on the precise details, but this patch essentially changes the order of initialization of some network-related components in order to avoid getting into a inconsistent state.

Fixes #272 (at least shall fix some of its occurrences)

Testing

Tested for regressions: Engine unit tests and integration tests passing. Tried it in a meta-balena branch; all tests passed. Also did some manual testing on a Pi 3.

Testing for effectiveness is another story. We don't have a reliable way to reproduce the issue, so I created a version of the Engine meant to crash at a point that triggers the issue. Now, I cannot tell for sure this is reproducing exactly the same case we are seeing in practice, but to me the symptoms look close enough to give a good confidence this is a step in the right direction.

I'll describe in details what I did to reproduce the issue and test the patch because this might be a good future reference should other similar issues appear (or this one re-appear).

First, based on this analysis we see that the issue happens when the Engine crashes at a more or less specific point. I tried to locate such point; not sure I found it exactly, but I found something -- and then added some code that allows us to force a crash right there:

diff --git a/daemon/container_operations.go b/daemon/container_operations.go
index af69e0474b..029d8176a6 100644
--- a/daemon/container_operations.go
+++ b/daemon/container_operations.go
@@ -825,6 +825,14 @@ func (daemon *Daemon) connectToNetwork(container *container.Container, idOrName
                return err
        }

+       ///////////////////
+       _, err = os.Stat("/mnt/data/crash-the-engine.please")
+       if err == nil {
+               os.Remove("/mnt/data/crash-the-engine.please")
+               panic("Geronimoooooo!")
+       }
+       //////////////////
+
        if !container.Managed {
                // add container name/alias to DNS
                if err := daemon.ActivateContainerServiceBinding(container.Name); err != nil {

For the test itself, I prepared two Engine versions: one containing the patch we are testing (balena-engine-patched), another containing the "crash code" above (balena-engine-crashable). I copied both to the data partition of a Pi 3, so that I can symlink /usr/bin/balena-engine to either of them as needed. And then:

  1. First let's reproduce the issue. Initial state: running balena-engine-crashable (but not forcing a crash yet!), user service (container) running, all nice and fine.
  2. Run ps aux | grep proxy, check the PIDs. In my case, 2216 and 2226.
  3. touch /mnt/data/crash-the-engine.please
  4. Restart the service.
  5. Engine crashes, service doesn't restart.
  6. But we get our stale balena-engine-proxy processes holding the ports. Check with lsof -nP -iTCP -sTCP:LISTEN and ps aux | grep proxy. Notice these are new processes (PIDs 2984 and 2993 in my case) created while bringing up the service again, before the forced crash.
  7. reboot
  8. We get balena-engine-proxy processes even before we try to start the service (IIUC, they are created as the Engine initializes the network subsystem; it's basically trying to restore the pre-reboot state.)
  9. As the device tries to start the service, we get the error we were looking for: "Failed to allocate and map port 80-80: Bind for 0.0.0.0:80 failed: port is already allocated". Service remains in the "Installed" state.
  10. Now let's test the patch. Redo steps 1-6.
  11. Replace the Engine with the patched version: mount -o remount,rw /, cd /usr/bin/, ln -nfs /mnt/data/balena-engine-patched balena-engine.
  12. reboot
  13. The service starts normally, no port binding issues at all!

So, looks like the patch helped, Q.E.D. :slightly_smiling_face:

Side note: If we reboot again between steps 9 and 10 , the service starts successfully. In this case, we apparently don't create balena-engine-proxy processes before attempting to start the service. I don't know why this happens -- why does this second reboot (apparently) makes the internal state consistent again?

klutchell commented 1 year ago

Very clever test case reproduction, looks good to me!