Update libnetwork to fix port binding issue

This updates balena-libnetwork to a version that should fix some port binding issues that may happen after balenaEngine or device crashes. Specifically, this balena-libnetwork version cherry-picks this unmerged upstream patch (with minor changes to make it compatible with recent Moby versions).

I cannot comment on the precise details, but this patch essentially changes the order of initialization of some network-related components in order to avoid getting into a inconsistent state.

Fixes #272 (at least shall fix some of its occurrences)

Testing

Tested for regressions: Engine unit tests and integration tests passing. Tried it in a meta-balena branch; all tests passed. Also did some manual testing on a Pi 3.

Testing for effectiveness is another story. We don't have a reliable way to reproduce the issue, so I created a version of the Engine meant to crash at a point that triggers the issue. Now, I cannot tell for sure this is reproducing exactly the same case we are seeing in practice, but to me the symptoms look close enough to give a good confidence this is a step in the right direction.

I'll describe in details what I did to reproduce the issue and test the patch because this might be a good future reference should other similar issues appear (or this one re-appear).

First, based on this analysis we see that the issue happens when the Engine crashes at a more or less specific point. I tried to locate such point; not sure I found it exactly, but I found something -- and then added some code that allows us to force a crash right there:

diff --git a/daemon/container_operations.go b/daemon/container_operations.go
index af69e0474b..029d8176a6 100644
--- a/daemon/container_operations.go
+++ b/daemon/container_operations.go
@@ -825,6 +825,14 @@ func (daemon *Daemon) connectToNetwork(container *container.Container, idOrName
                return err
        }

+       ///////////////////
+       _, err = os.Stat("/mnt/data/crash-the-engine.please")
+       if err == nil {
+               os.Remove("/mnt/data/crash-the-engine.please")
+               panic("Geronimoooooo!")
+       }
+       //////////////////
+
        if !container.Managed {
                // add container name/alias to DNS
                if err := daemon.ActivateContainerServiceBinding(container.Name); err != nil {

For the test itself, I prepared two Engine versions: one containing the patch we are testing (balena-engine-patched), another containing the "crash code" above (balena-engine-crashable). I copied both to the data partition of a Pi 3, so that I can symlink /usr/bin/balena-engine to either of them as needed. And then:

First let's reproduce the issue. Initial state: running balena-engine-crashable (but not forcing a crash yet!), user service (container) running, all nice and fine.
Run ps aux | grep proxy, check the PIDs. In my case, 2216 and 2226.
touch /mnt/data/crash-the-engine.please
Restart the service.
Engine crashes, service doesn't restart.
But we get our stale balena-engine-proxy processes holding the ports. Check with lsof -nP -iTCP -sTCP:LISTEN and ps aux | grep proxy. Notice these are new processes (PIDs 2984 and 2993 in my case) created while bringing up the service again, before the forced crash.
reboot
We get balena-engine-proxy processes even before we try to start the service (IIUC, they are created as the Engine initializes the network subsystem; it's basically trying to restore the pre-reboot state.)
As the device tries to start the service, we get the error we were looking for: "Failed to allocate and map port 80-80: Bind for 0.0.0.0:80 failed: port is already allocated". Service remains in the "Installed" state.
Now let's test the patch. Redo steps 1-6.
Replace the Engine with the patched version: mount -o remount,rw /, cd /usr/bin/, ln -nfs /mnt/data/balena-engine-patched balena-engine.
reboot
The service starts normally, no port binding issues at all!

So, looks like the patch helped, Q.E.D. :slightly_smiling_face:

Side note: If we reboot again between steps 9 and 10 , the service starts successfully. In this case, we apparently don't create balena-engine-proxy processes before attempting to start the service. I don't know why this happens -- why does this second reboot (apparently) makes the internal state consistent again?

balena-os / balena-engine

Update libnetwork to fix port binding issue #428