SensorsIot / IOTstack

Docker stack for getting started on IOT on the Raspberry PI
GNU General Public License v3.0
1.42k stars 303 forks source link

Server loses connectivity after ~24 hours #770

Closed calcut closed 3 weeks ago

calcut commented 4 weeks ago

I've run into an issue running this on a Hetzner cloud server. Still narrowing it down, but I'm suspecting something related to DHCP that has been configured by IoTStack (or possibly pibuilder).

Basically about 24 hours after a reboot, the server loses all network connectivity. I can't ssh in, although I can get in through Hetzner's web console and see that ping 8.8.8.8 fails.

slightly more detail is provided by sudo journalctl -xeu networking.service

Jun 04 13:34:31 debian-4gb-dwt systemd[1]: networking.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ The unit networking.service has entered the 'failed' state with result 'exit-code'.
Jun 04 13:34:31 debian-4gb-dwt systemd[1]: Failed to start networking.service - Raise network interfaces.
░░ Subject: A start job for unit networking.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit networking.service has finished with a failure.
░░ 
░░ The job identifier is 503 and the job result is failed.

The server is an ARM architecture running Debian 12. Using more or less the typical stuff - NodeRED, Influx, Grafana 1.x, Portainer, and Nginx Proxy Manager. But I've seen the issue with most of those containers taken down.

The main thing I've tried to fix is commenting out this line # allowinterfaces eth*,wlan* in /etc/dhcpcd.conf but that doesn't seem to fix it.

Any tips?

Other potentially relevant info

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff
    altname enp1s0
    inet XX.XX.XX.XX/32 brd 95.216.199.64 scope global dynamic eth0
       valid_lft 69568sec preferred_lft 69568sec
3: br-e91a4a57e3b6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff
    inet XX.XX.XX.XX/16 brd 172.18.255.255 scope global br-e91a4a57e3b6
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff
    inet XX.XX.XX.XX/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
6: veth01334b1@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-e91a4a57e3b6 state UP group default 
    link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff link-netnsid 1
8: vetha2e19f6@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-e91a4a57e3b6 state UP group default 
    link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff link-netnsid 5
10: veth3ba447f@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-e91a4a57e3b6 state UP group default 
    link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff link-netnsid 4
12: veth21ed5a8@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-e91a4a57e3b6 state UP group default 
    link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff link-netnsid 0
14: veth846e638@if13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-e91a4a57e3b6 state UP group default 
    link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff link-netnsid 3
16: vethd6c7c68@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-e91a4a57e3b6 state UP group default 
    link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff link-netnsid 2
Paraphraser commented 4 weeks ago

Intriguing. In my case, the closest ARM platform I can get is:

4GB Raspberry Pi 4 Model B Rev 1.1 running Debian GNU/Linux 12 (bookworm) as full 64-bit OS

This was built using Raspberry Pi OS plus PiBuilder.

My Raspberry Pi OS starting point is always the "with desktop" image, rather than either the "with desktop and recommended software" or "lite" image variants.

Originally, there was no reason for this being my starting point other than "it seemed like a good idea at the time". I've just stuck with it ever since. If it ain't broke...

The reason why the starting point may be important is because of something I've noticed when building Proxmox guests. I start from the Debian "netinst" ISO but the final behaviour depends on options I choose during the install of the system (ie long before PiBuilder does anything). In particular, if I turn off the "Debian desktop environment" in the OS installer then Network Manager does not get installed and neither do some other useful things like the avahi daemon.

On the Pi, my rule-of-thumb has been "if it's Bookworm or later then Network Manager is active, otherwise it's 'older-style' networking".

I'd struggle to define exactly what 'older-style' really means.

Although I haven't tried it, it's possible that starting from the Raspberry Pi OS "Lite" image (which, according to the doco, has no Desktop environment) will behave like the netinst.iso on Proxmox with the desktop environment turned off: no network manager.

You may be wondering if there's a point to all the above. There is.

Way back in the early days of IOTstack on Buster, we encountered two problems:

  1. Pis would hang during boot-up; and
  2. WiFi interfaces kept disappearing.

The hang during boot-up was solved by @gbsmith - see #253 - with the initial solution being to explicitly allow DHCP on eth0 and wlan0 (implicitly denying DHCP to all other interfaces). This was later modified to the wildcard form you will see in the recommended patch, which caters for people with USB-to-Ethernet dongles. This patch is implemented by PiBuilder.

I did note, in passing, that Proxmox guests use ensNN naming for virtual Ethernet ports. In theory, that means the current allowinterfaces rule would exclude those. However, those interfaces do obtain dynamic IP addresses via DHCP and I've been assuming that Proxmox must be managing the DHCP requests.

I've just done a bit more experimenting. The baseline /etc/dhcpcd.conf seems to be non-existent on Bookworm systems. But that's in the presence of Network Manager. PiBuilder adds the patch but disabling the patch results in no material difference in ip a. Under Buster (ie no Network Manager), disabling the patch results in the Docker virtual interfaces gaining link-local IP addresses. That was the original problem - that allocation process could create a race condition and the Pi would hang during boot-up.

All up, I'd say the patch is still needed for non-Network Manager systems but is ignored in Network Manager systems. I'll fix PiBuilder so it doesn't bother patching Network Manager systems.

There's background to the second problem here.

Initially, this problem seemed to be confined to WiFi interfaces but it later became apparent that Ethernet interfaces would also occasionally go walkabout so I generalised the "fix".

The fix is brute force and pretty darn unsubtle. PiBuilder only tries to activate it if Network Manager is not running and, even then, it only applies to eth0 and wlan0. It also assumes /etc/rc.local runs at boot time and my journey through Proxmox has made me realise that isn't always true.

The reason it is only needed if Network Manager is not running is because Network Manager already tries to keep interfaces active. And does a much better job of it!

Rolling all that together, I'd say:

  1. Check if Network Manager is running.
  2. If Network Manager isn't running then:

    • see if PiBuilder has installed /usr/bin/isc-dhcp-fix.sh
    • see if /etc/rc.local purports to launch isc-dhcp-fix.sh and that the eth0 interface is listed on the command.
    • verify that rc.local is actually being launched. Something like:

      $ ps -ax | grep isc-dhcp-fix.sh | grep -v grep
    • make sure the /etc/dhcpcd.conf patch is still in place because it's likely to be doing useful things.

Now to some cold hard facts:

  1. I have never encountered a freezing interface on a Proxmox guest.
  2. I have never encountered a freezing interface on a Raspberry Pi running Bookworm.
  3. In both of the above, NetworkManager is active so the isc-dhcp-fix.sh mechanism is never activated by PiBuilder.
  4. I still encounter freezing interfaces on Raspberry Pis running Bullseye but the isc-dhcp-fix.sh takes care of the problem, both on Ethernet and WiFi.

I wanted to compare what I was seeing on my Bookworm Raspberry Pi with the steps/output in your issue.

First, is Network Manager running?

$ sudo systemctl is-active NetworkManager.service
active

$ nmcli -t -f RUNNING general
running

Yes. Network Manager is running.

What is the state of the interfaces?

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether dc:a6:32:41:60:ef brd ff:ff:ff:ff:ff:ff
    inet 192.168.132.100/24 brd 192.168.132.255 scope global dynamic noprefixroute eth0
       valid_lft 603610sec preferred_lft 603610sec
3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether dc:a6:32:41:60:f0 brd ff:ff:ff:ff:ff:ff
    inet 192.168.132.101/24 brd 192.168.132.255 scope global dynamic noprefixroute wlan0
       valid_lft 607839sec preferred_lft 607839sec
4: br-6b22c68d7e90: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:13:af:bb:0d brd ff:ff:ff:ff:ff:ff
    inet 172.30.0.1/22 brd 172.30.3.255 scope global br-6b22c68d7e90
       valid_lft forever preferred_lft forever
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:ae:0a:14:32 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
9: veth82a6071@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-6b22c68d7e90 state UP group default 
    link/ether 26:0b:5c:61:1c:4a brd ff:ff:ff:ff:ff:ff link-netnsid 1
11: veth4843c76@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-6b22c68d7e90 state UP group default 
    link/ether 12:05:10:2f:43:49 brd ff:ff:ff:ff:ff:ff link-netnsid 2
13: vethafd40c4@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-6b22c68d7e90 state UP group default 
    link/ether 9a:7e:b1:1d:36:0a brd ff:ff:ff:ff:ff:ff link-netnsid 3
15: veth392ca19@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-6b22c68d7e90 state UP group default 
    link/ether 9a:1c:92:c5:a8:e7 brd ff:ff:ff:ff:ff:ff link-netnsid 0

Is isc-dhcp-fix.sh running?

$ ps -ax | grep isc-dhcp-fix.sh | grep -v grep
$

No! This is the expected behaviour in a system built by PiBuilder in the presence of Network Manager.

Is DHCP (theoretically) restricted to the expected Raspberry Pi physical interfaces?

$ cat /etc/dhcpcd.conf

# patch needed for IOTstack - stops RPi freezing during boot.
# see https://github.com/SensorsIot/IOTstack/issues/219
# see https://github.com/SensorsIot/IOTstack/issues/253
allowinterfaces eth*,wlan*

Yes. Again, the expected behaviour with PiBuilder. And, if you go back to the ip a you'll see none of the veth interfaces has an visible IP address. The docker0 and br-* are allocated by Docker. But, if you're running Network Manager, this is irrelevant (as explained earlier).

What's the story with my "networking service"?

$ sudo systemctl status networking.service
● networking.service - Raise network interfaces
     Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
     Active: active (exited) since Thu 2024-05-23 11:25:05 AEST; 1 week 6 days ago
       Docs: man:interfaces(5)
   Main PID: 541 (code=exited, status=0/SUCCESS)
        CPU: 251ms

Now, I have to admit that I have never before ever considered the "networking service" and, other than the inference one can make from its name that it might have something to do with networking, I really have no idea what it does or how it fits in. In other words, although I can see it has exited, I have no idea whether that's significant. I can assure you this Pi is working despite that exit status so, on balance, I'd be inclined to ignore it.

Do I see anything about the networking service in my log?

$ sudo journalctl -xeu networking.service
-- No entries --

No. How long has this Pi been up?

$ uptime
 11:48:10 up 13 days, 22 min,  1 user,  load average: 0.32, 0.19, 0.11

Well, that sort of matches-up. Today (June 5) minus 13 days is May 23 which is what the status output is showing, so it looks like this starts and exits at boot time.

After a process of elimination, I find that I actually have to go back 55 days to find anything about this service in my journal:

$ sudo journalctl -xu networking.service --since "55 days ago"
Apr 11 11:59:02 tri-dev systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ An ExecStart= process belonging to unit networking.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
Apr 11 11:59:03 tri-dev systemd[1]: networking.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ The unit networking.service has entered the 'failed' state with result 'exit-code'.
Apr 11 11:59:03 tri-dev systemd[1]: Failed to start networking.service - Raise network interfaces.
░░ Subject: A start job for unit networking.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit networking.service has finished with a failure.
░░ 
░░ The job identifier is 99 and the job result is failed.
-- Boot 50378e6b432a44d2b3c6d2ce40eda655 --
Apr 11 12:00:04 tri-dev systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ An ExecStart= process belonging to unit networking.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
Apr 11 12:00:04 tri-dev systemd[1]: networking.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ The unit networking.service has entered the 'failed' state with result 'exit-code'.
Apr 11 12:00:04 tri-dev systemd[1]: Failed to start networking.service - Raise network interfaces.
░░ Subject: A start job for unit networking.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit networking.service has finished with a failure.
░░ 
░░ The job identifier is 104 and the job result is failed.
-- Boot 3d183be1547848cc88c637958a051321 --
Apr 11 15:31:29 tri-dev systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ An ExecStart= process belonging to unit networking.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
Apr 11 15:31:30 tri-dev systemd[1]: networking.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ The unit networking.service has entered the 'failed' state with result 'exit-code'.
Apr 11 15:31:30 tri-dev systemd[1]: Failed to start networking.service - Raise network interfaces.
░░ Subject: A start job for unit networking.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit networking.service has finished with a failure.
░░ 
░░ The job identifier is 96 and the job result is failed.
-- Boot 22fe94cd1d7249709a7efcd8679fbef1 --
Apr 11 15:32:48 tri-dev systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ An ExecStart= process belonging to unit networking.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
Apr 11 15:32:48 tri-dev systemd[1]: networking.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ The unit networking.service has entered the 'failed' state with result 'exit-code'.
Apr 11 15:32:48 tri-dev systemd[1]: Failed to start networking.service - Raise network interfaces.
░░ Subject: A start job for unit networking.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit networking.service has finished with a failure.
░░ 
░░ The job identifier is 95 and the job result is failed.

That's a lot of crud to wade through so let me reduce it to what I see as its essence:

Apr 11 11:59:02 Subject: Unit process exited
Apr 11 11:59:03 Subject: Unit failed
Apr 11 11:59:03 Subject: A start job for unit networking.service has failed
-- Boot 50378e6b432a44d2b3c6d2ce40eda655 --
Apr 11 12:00:04 Subject: Unit process exited
Apr 11 12:00:04 Subject: Unit failed
Apr 11 12:00:04 Subject: A start job for unit networking.service has failed
-- Boot 3d183be1547848cc88c637958a051321 --
Apr 11 15:31:29 Subject: Unit process exited
Apr 11 15:31:30 Subject: Unit failed
Apr 11 15:31:30 Subject: A start job for unit networking.service has failed
-- Boot 22fe94cd1d7249709a7efcd8679fbef1 --
Apr 11 15:32:48 Subject: Unit process exited
Apr 11 15:32:48 Subject: Unit failed
Apr 11 15:32:48 Subject: A start job for unit networking.service has failed

All those reboots tell me I was chasing a problem and, if memory serves, the Pi had developed an intermittent fault where either I couldn't connect to it over SSH, or I'd be connected and the connection would reset.

My first suspicion was the Pi itself but I soon realised that I could connect to it just fine via its WiFi interface, so I started to focus on the Ethernet side of things: the port, the cable, the switch port.

It turned out to be a bad Ethernet patch cable. The cable wasn't new (it had been working for years). It hadn't been crunched or twisted. It tested OK for continuity. It had just somehow gone flaky. As soon as I replaced it, the problem went away.

Maybe you have a similar problem? To be honest, I don't see how that explains your "after 24 hours" but it's still something to consider.

And, on that topic, in the last six months I've had to throw out an old gigabit switch which, after a decade of faithful service, suddenly decided to reduce its transfer rate to about 5mbps. There was no evidence of any malfunction in any of the LEDs but when I picked it up to start swapping cables between ports, it was hotter than the blazes of Hell. I dropped it with a yelp and turned it off.

Hope this helps.

calcut commented 4 weeks ago

wow... lots to think about.

for now:

I don't seem to be using network manager

$ sudo systemctl is-active NetworkManager.service
inactive

nmcli -t -f RUNNING general
-bash: nmcli: command not found

/usr/bin/isc-dhcp-fix.sh doesn't exist, neither does /etc/rc.local

As for physical hardware problems - who knows! Its a cloud server so I have no physical access. TBH I tried running my own hardware, but my internet upload speed is miserable so I needed to switch to cloud hosted for remote access.

Starting to think I should start fresh and add in the containers I need without pibuilder, which is not really targeting what I'm trying to do.

Paraphraser commented 4 weeks ago

You said Debian 12 but with Network Manager out of the picture, I'm surmising this system would likely have been built as I described before (no desktop).

Or words to that effect...

Now, if I'm correct about all my other suppositions about what does or doesn't work when Network Manager is or isn't running, I think this is what I would try next.

First, the body of the isc-dhcp-fix.sh script:

#!/bin/bash

logger "isc-dhcp-fix launched"

while [ $# -gt 0 ] ; do
   for CARD in $@ ; do
      ifconfig "$CARD" | grep -Po '(?<=inet )[\d.]+' &> /dev/null
      if [ $? != 0 ]; then
         logger "isc-dhcp-fix resetting $CARD"
         ifconfig "$CARD" up
         sleep 5
      fi
      sleep 1
   done
   sleep 1
done

Stick that at the path below and make the ownership and permissions match those shown:

-r-xr-xr-x 1 root root 325 Nov 19  2023 /usr/bin/isc-dhcp-fix.sh

Launch it like this (approximates rc.local):

$ sudo -s
# /usr/bin/isc-dhcp-fix.sh eth0 &
# disown -h %1
# exit

Then you can logout and it will continue running. Periodically, you can see what it's been up to with some variation on:

journalctl --since "1 day ago" --grep "isc-dhcp-fix" -q --no-pager -o short

If it's "fired" in the sense of re-enabling the eth0 interface then you will have evidence supporting a conclusion that the loss of connectivity you're seeing is explained by the network interface going down.

If you lose connectivity but find the script hasn't fired then the interface probably isn't going down so you'll have to look for a different explanation.

calcut commented 4 weeks ago

Yes, its Debian 12 image provided by Hetzner, so 99% sure no desktop. I haven't really done anything to it other than add IOTStack.

I've set up the script you provided - thanks for that. I'm currently also running with all of the docker containers taken down to see how that affects it.

If it goes 24 hours without issue I'll start the containers again.

Separately I've started another server with IOTstack (running MING containers), but not built using pibuilder. will see how that compares.

calcut commented 3 weeks ago

The server is still alive (connected) after more than 24 hours running with all docker containers down. Have put them back up to double check that it breaks again. I've seen it lose connectivity with just node-red, mosquitto and influx up. narrowing it down to one of those may be the next step.

As for the other server (the non-pibuilder one) no problems so far, although I haven't configured much or sent any data through it.

Paraphraser commented 3 weeks ago

You've certainly got a puzzler here.

Other than the DHCP/link local problem which the "allowinterfaces" fixed, I've never come across a networking problem which could be explained by whether or not a container (or a collection of containers) was or wasn't running.

The "allowinterfaces" issue was confined to reboots. The trigger conditions were containers running, those containers under either "unless-stopped" or "always" restart clauses, and then fire the reboot. In other words, downing the stack, rebooting, and upping the stack afterwards would avoid the problem. And, even when the trigger conditions were present it was an intermittent fault, and also seemed correlated with the numbers of containers, which suggested a race condition.

it always seemed to make a kind of sense to me that N containers, all looking for DHCP, not finding it (because no DHCP server was present on the internal network) and all falling back to link-local assignments where, by definition, each IP stack had to ensure that it hadn't randomly selected the same IP address, would lead to the kind of "flurry" of network activity which is darn near impossible to test for and, accordingly, might well lead to a previously unknown race condition.

But I digress. To return to the second para above, I'm curious to know what containers you are running, whether any are running in host mode, and whether any have been given access to features such as network capabilities, privileged flags, cgroup rules, or anything else that might let the container get close to the "hardware".

And, naturally, I will have a continuing interest in whether you discover that PiBuilder has had some role in all of this. I can only test so far. I know PiBuilder works on Pis (specifically 4s and Zero2W; haven't purchased a 5 as yet), Debian native on Intel, Debian Proxmox guests on Intel, and both Bullseye and Bookworm (it did work on Buster but it's a while since I tested that). It also seemed to work on Ubuntu Proxmox guest but that was more of "on a whim in an idle moment" thing than it was any serious attempt to confirm it really did work properly.

Paraphraser commented 3 weeks ago

This discussion motivated me to figure out how to get Network Manager running on Debian systems where it was not installed by default.

I've only tried it on a Proxmox-VE guest. It's on my to-do list to see whether (a) it's needed and (b) the process works on Raspberry Pi OS "Lite" systems.

I have no idea whether it will be useful in your hosted environment but, on the off-chance it is, see this gist.

calcut commented 3 weeks ago

Hmm, so it still hasn't broken after a couple of days of the containers running! the mystery deepens.

The only differences I can think of are: 1) running the script that you suggested,

2) I've had port 1880 (node red) open on the firewall. However, I'm not (intentionally) using it as I'm going via Nginx proxy manager.

I've closed the port again to see what happens. but it might be a complete red herring.

Paraphraser commented 3 weeks ago

If the script has been firing and keeping the interface up then there will be evidence in the log.

And if that's true then that's where I reckon getting NetworkManager going will do the same thing, just far more cleanly.

Paraphraser commented 3 weeks ago

I also now know the answer to the question about Raspberry Pi OS and starting from the Lite image. Unlike Debian where disabling the Desktop at installer time results in neither the Avahi daemon not NetworkManager being installed, Raspberry Pi OS installs both.

calcut commented 3 weeks ago

I don't see anything in the log, but everything still running. can't explain why! Unless it was something on Hetzner's end that has been fixed.

root@debian-4gb-dwt:/home/cc# journalctl --since "8 days ago" --grep "isc-dhcp-fix" -q --no-pager -o short
Jun 05 13:50:58 debian-4gb-dwt sudo[4272]:       cc : TTY=pts/0 ; PWD=/home/cc ; USER=root ; COMMAND=/usr/bin/vi /usr/bin/isc-dhcp-fix.sh
Jun 05 13:53:29 debian-4gb-dwt root[4287]: isc-dhcp-fix launched
calcut commented 3 weeks ago

I'll close this for now as can no longer reproduce! Thanks for your input