This updates balena-libnetwork to a version that should fix some port binding issues that may happen after balenaEngine or device crashes. Specifically, this balena-libnetwork version cherry-picks this unmerged upstream patch (with minor changes to make it compatible with recent Moby versions).
I cannot comment on the precise details, but this patch essentially changes the order of initialization of some network-related components in order to avoid getting into a inconsistent state.
Fixes #272 (at least shall fix some of its occurrences)
Testing
Tested for regressions: Engine unit tests and integration tests passing. Tried it in a meta-balena branch; all tests passed. Also did some manual testing on a Pi 3.
Testing for effectiveness is another story. We don't have a reliable way to reproduce the issue, so I created a version of the Engine meant to crash at a point that triggers the issue. Now, I cannot tell for sure this is reproducing exactly the same case we are seeing in practice, but to me the symptoms look close enough to give a good confidence this is a step in the right direction.
I'll describe in details what I did to reproduce the issue and test the patch because this might be a good future reference should other similar issues appear (or this one re-appear).
First, based on this analysis we see that the issue happens when the Engine crashes at a more or less specific point. I tried to locate such point; not sure I found it exactly, but I found something -- and then added some code that allows us to force a crash right there:
For the test itself, I prepared two Engine versions: one containing the patch we are testing (balena-engine-patched), another containing the "crash code" above (balena-engine-crashable). I copied both to the data partition of a Pi 3, so that I can symlink /usr/bin/balena-engine to either of them as needed. And then:
First let's reproduce the issue. Initial state: running balena-engine-crashable (but not forcing a crash yet!), user service (container) running, all nice and fine.
Run ps aux | grep proxy, check the PIDs. In my case, 2216 and 2226.
touch /mnt/data/crash-the-engine.please
Restart the service.
Engine crashes, service doesn't restart.
But we get our stale balena-engine-proxy processes holding the ports. Check with lsof -nP -iTCP -sTCP:LISTEN and ps aux | grep proxy. Notice these are new processes (PIDs 2984 and 2993 in my case) created while bringing up the service again, before the forced crash.
reboot
We get balena-engine-proxy processes even before we try to start the service (IIUC, they are created as the Engine initializes the network subsystem; it's basically trying to restore the pre-reboot state.)
As the device tries to start the service, we get the error we were looking for: "Failed to allocate and map port 80-80: Bind for 0.0.0.0:80 failed: port is already allocated". Service remains in the "Installed" state.
Now let's test the patch. Redo steps 1-6.
Replace the Engine with the patched version: mount -o remount,rw /, cd /usr/bin/, ln -nfs /mnt/data/balena-engine-patched balena-engine.
reboot
The service starts normally, no port binding issues at all!
So, looks like the patch helped, Q.E.D. :slightly_smiling_face:
Side note:If we reboot again between steps 9 and 10 , the service starts successfully. In this case, we apparently don't create balena-engine-proxy processes before attempting to start the service. I don't know why this happens -- why does this second reboot (apparently) makes the internal state consistent again?
This updates balena-libnetwork to a version that should fix some port binding issues that may happen after balenaEngine or device crashes. Specifically, this balena-libnetwork version cherry-picks this unmerged upstream patch (with minor changes to make it compatible with recent Moby versions).
I cannot comment on the precise details, but this patch essentially changes the order of initialization of some network-related components in order to avoid getting into a inconsistent state.
Fixes #272 (at least shall fix some of its occurrences)
Testing
Tested for regressions: Engine unit tests and integration tests passing. Tried it in a meta-balena branch; all tests passed. Also did some manual testing on a Pi 3.
Testing for effectiveness is another story. We don't have a reliable way to reproduce the issue, so I created a version of the Engine meant to crash at a point that triggers the issue. Now, I cannot tell for sure this is reproducing exactly the same case we are seeing in practice, but to me the symptoms look close enough to give a good confidence this is a step in the right direction.
I'll describe in details what I did to reproduce the issue and test the patch because this might be a good future reference should other similar issues appear (or this one re-appear).
First, based on this analysis we see that the issue happens when the Engine crashes at a more or less specific point. I tried to locate such point; not sure I found it exactly, but I found something -- and then added some code that allows us to force a crash right there:
For the test itself, I prepared two Engine versions: one containing the patch we are testing (
balena-engine-patched
), another containing the "crash code" above (balena-engine-crashable
). I copied both to the data partition of a Pi 3, so that I can symlink/usr/bin/balena-engine
to either of them as needed. And then:balena-engine-crashable
(but not forcing a crash yet!), user service (container) running, all nice and fine.ps aux | grep proxy
, check the PIDs. In my case, 2216 and 2226.touch /mnt/data/crash-the-engine.please
balena-engine-proxy
processes holding the ports. Check withlsof -nP -iTCP -sTCP:LISTEN
andps aux | grep proxy
. Notice these are new processes (PIDs 2984 and 2993 in my case) created while bringing up the service again, before the forced crash.reboot
balena-engine-proxy
processes even before we try to start the service (IIUC, they are created as the Engine initializes the network subsystem; it's basically trying to restore the pre-reboot state.)mount -o remount,rw /
,cd /usr/bin/
,ln -nfs /mnt/data/balena-engine-patched balena-engine
.reboot
So, looks like the patch helped, Q.E.D. :slightly_smiling_face:
Side note: If we reboot again between steps 9 and 10 , the service starts successfully. In this case, we apparently don't create
balena-engine-proxy
processes before attempting to start the service. I don't know why this happens -- why does this second reboot (apparently) makes the internal state consistent again?