Paraphraser / PiBuilder

Ideas for building a Raspberry Pi from "bare metal" to ready-to-run IOTstack
MIT License
110 stars 24 forks source link

Raspberry Pi Randomly looses Network Connection #5

Closed BlueWings172 closed 2 years ago

BlueWings172 commented 2 years ago

Hello

My Raspberry Pi 4b randomly becomes inaccessible in LAN.

I have posted this issue months ago in the IOTstack page and and it I was suggested to use Pi Builder.

So I finally found the time to back my data up and give Pi Builder a try. The few first days were eventless but unfortunately then the Pi started disappearing from the network and becomes inaccessible several times a day which renders my setup completely useless. This happens regardless if the Pi is connect via WIFI or Ethernet. This is a real bummer because I've spent so much time trying to learn and implement new things. I'm using a brand new SD card.

I would appreciate any help.

Thanks

Paraphraser commented 2 years ago

Let's work backwards. I'm going to assume a Bullseye 64-bit installation using PiBuilder. Please let me know if your setup is different.

The first thing to check is:

$ tail -5 /etc/dhcpcd.conf
# patch needed for IOTstack - stops RPi freezing during boot.
# see https://github.com/SensorsIot/IOTstack/issues/219
# see https://github.com/SensorsIot/IOTstack/issues/253
allowinterfaces eth*,wlan*

What that does is tells the DHCP client running in the Pi to only consider Ethernet and WiFi interfaces as candidates for dynamic address allocation. Without that in place, all the veth interfaces that Docker sets up when you bring up your stack also try to participate in something Docker is already doing. That can have knock-on effects at boot time (system stalls) or during network transients.

The second thing to check is:

$ grep "isc-dhcp-fix.sh" /etc/rc.local
/usr/bin/isc-dhcp-fix.sh eth0 wlan0 &

Notice the "eth0" and "wlan0" arguments. That reflects the fact that my Pi has both of those interfaces active so both of them are being "kept alive" by the isc-dhcp-fix.sh script. It's important to realise that this line is setup by PiBuilder at install time. It isn't dynamic so if, for example, Ethernet wasn't "there" at PiBuilder time but WiFi was then that line will only have "wlan0" in it.

In other words, you should make sure that the interfaces provided as arguments on that line reflect what you actually need in practice.

The third thing to check is whether isc-dhcp-fix.sh is firing. The easiest way to do that is:

$ grep -a "isc-dhcp-fix" /var/log/syslog

Remember that syslog rotates. For Buster and earlier that was every 24 hours. For Bullseye and later, it's every week (but there's a PiBuilder tutorial in how to put it back to 24 hours). The point is getting silence from that grep can mean either:

To make sure it is there:

pi@sec-dev:~$ cat /usr/bin/isc-dhcp-fix.sh
#!/bin/bash

logger "isc-dhcp-fix launched"

while [ $# -gt 0 ] ; do
   for CARD in $@ ; do
      ifconfig "$CARD" | grep -Po '(?<=inet )[\d.]+' &> /dev/null
      if [ $? != 0 ]; then
         logger "isc-dhcp-fix resetting $CARD"
         ifconfig "$CARD" up
         sleep 5
      fi
      sleep 1
   done
   sleep 1
done

To make sure it can be run:

$ sudo /usr/bin/isc-dhcp-fix.sh eth0 wlan0

Unless you get an error (eg no execute permission) you'll probably get silence. Wait about 10 seconds and then hit control+C, then run the grep again. You should at least get the "launched" message.

After any reboot, I generally get this triple:

Jun 11 14:25:58 sec-dev root: isc-dhcp-fix launched
Jun 11 14:25:58 sec-dev root: isc-dhcp-fix resetting eth0
Jun 11 14:26:04 sec-dev root: isc-dhcp-fix resetting wlan0

Now I'll make a different assumption which is the interface arguments in /etc/rc.local are correct, that isc-dhcp-fix.sh is firing properly, and a grep of the log produces a ton of "resetting" messages.

If that's the case then I'd start to divide and conquer. If I had Ethernet but WiFi was being problematic, I'd probably disable WiFi for a while and see if Ethernet was more robust.

If Ethernet was flaky then I'd change my Ethernet cable to see if it influenced the behaviour. Then I'd try a different switch port or perhaps a totally different switch.

If I came to suspect the Raspberry Pi Ethernet port itself, I'd invert the problem by enabling WiFi and disabling the Ethernet port. If the problem persisted I'd then start to worry about what else might be going on inside the Pi.


For the record, I've got four Pi4s (one Buster, three Bullseye), all 4GB RAM running on SSDs, all dual Ethernet and WiFi, all built with PiBuilder. I do see occasional "resetting wlan0" messages (maybe one a week across all four machines) but, except at boot time or when there's an obvious explanation (like pulling out an Ethernet cable) it's very very rare for me to see a "resetting eth0" message.

To make the claim in the previous paragraph more "evidence" than "anecdote", I just grabbed all the logs from all four machines. That's the last 7 days plus the current day on each machine: 32 log files in total. I found 24 hits, all from my test Pi, all were the "triples" characteristic of a reboot (example above). The test machine is something I reboot all the time so that makes sense. The other three I rarely reboot so the only reasonable inference is isc-dhcp-fix.sh isn't firing because Ethernet and WiFi aren't bouncing.

While n=4 doesn't really prove all that much, it does at least demonstrate that the combination of Pi 4 + PiBuilder doesn't always result in network interface problems.

The last point I'll make is that, if you're seeing "resetting" messages when you run grep, those are just evidence of the script sensing that an interface has gone down. They are written to the log with logger so that you can open the log and search for them, then look to see what else is writing messages into the log that might guide you to the underlying problem. You also get a timestamp so you can cross-correlate with other files in the log directory, or logs kept by other devices on your network.

Hope something in all of the above helps you track down and nail this problem.

BlueWings172 commented 2 years ago

@Paraphraser

Your assumption is correct; I'm running a Bullseye 64-bit installation using PiBuilder.

1- the output of tail -5 /etc/dhcpcd.conf is identical to yours.

pi@raspberrypi:~ $ tail -5 /etc/dhcpcd.conf
# patch needed for IOTstack - stops RPi freezing during boot.
# see https://github.com/SensorsIot/IOTstack/issues/219
# see https://github.com/SensorsIot/IOTstack/issues/253
allowinterfaces eth*,wlan*

2- the output of grep "isc-dhcp-fix.sh" /etc/rc.local is identical to yours.

pi@raspberrypi:~ $ grep "isc-dhcp-fix.sh" /etc/rc.local
/usr/bin/isc-dhcp-fix.sh eth0 wlan0 &

3- The line grep -a "isc-dhcp-fix" /var/log/syslog just flooded my screen with events. I actually had to output to a txt file to be able to view it. For some days there is an event exactly every 8 seconds (10,742 per day). For others days where the Pi spent much of the time frozen, there are much less events. Exactly 99.87% of these events are for eth0 and the rest for wlan0 (maybe because in the begging of the week, i was using wifi only?). That said, when the Pi is inaccessible, they both do not respond to ping, SSH, or browser access. Here are 2 samples:

Jun  5 00:00:06 raspberrypi root: isc-dhcp-fix resetting eth0
Jun  5 00:00:14 raspberrypi root: isc-dhcp-fix resetting eth0
Jun  5 00:00:22 raspberrypi root: isc-dhcp-fix resetting eth0
Jun  9 10:38:25 raspberrypi root: isc-dhcp-fix launched
Jun  9 10:38:25 raspberrypi root: isc-dhcp-fix resetting eth0
Jun  9 10:38:31 raspberrypi root: isc-dhcp-fix resetting wlan0
Jun  9 11:37:49 raspberrypi root: isc-dhcp-fix launched
Jun  9 11:37:49 raspberrypi root: isc-dhcp-fix resetting eth0

Note: out of the 40k lines, only 0.07% said 'launched' and the rest were 'resetting'.

4- Obviously isc-dhcp-fix.sh is installed and works and the line cat /usr/bin/isc-dhcp-fix.sh outputs the same content you included.

5- The line sudo /usr/bin/isc-dhcp-fix.sh eth0 wlan0 did not output anything. Running the grep command again, did indeed output the launchedevent.

6- After reboot and running the grep command, I did get the below lines as expected:

Jun 11 17:39:14 raspberrypi root: isc-dhcp-fix launched
Jun 11 17:39:15 raspberrypi root: isc-dhcp-fix resetting eth0
Jun 11 17:39:21 raspberrypi root: isc-dhcp-fix resetting wlan0

Your assumptions is correct, regarding rc.local, isc-dhcp-fix.sh, and grep. When I started experiencing issues with WIFI, I connected the ethernet cable. That did not help at all. They get both disconnected at the same time. Ethernet cable is for sure working and high quality as I'm using it for another machine.

I would buy another Pi if they were available. You have more chance to win the lottery that to find a reasonable priced Pi now.

As I mentioned, I had previously had issues with IOTstack and Wifi. I was hoping that the combination of ethernet, Bullseye and PiBuilder will eliminate the problem. But this turned to be a huge waste of time on unreliable toys.

I really appreciate Windows now.

Thanks for all the effort and time you spent helping.

Paraphraser commented 2 years ago

I've been chasing a problem with a Raspberry Pi 3B+ for some time. I had a collection of notes which I've just turned into this gist. It might turn out to be relevant to your situation.

BlueWings172 commented 2 years ago

Thanks for all the valuable info. My power supply is rated at 3A. I have tried several phone chargers, USB adapters and power strips with USB outlets. 9 in total different options to power the Pi, that should be providing 2.4A or more but my Pi experienced these network freezes (both Ethernet and WIFI) with all 9 of them. I have also been monitoring the PI power consumption using one of these.

The highest draw I've seen is 1.5A but I noticed that with some phone adapters, the Pi doesn't get more than 0.9A. That said, my tests were not long enough to be conclusive.

I have ordered a new 3A power supply which should be higher quality than the one I have. It should arrive tomorrow so I will test and report back.

Thanks again brother.

Paraphraser commented 2 years ago

I'm a bit worried that you might have missed the point of the gist and, accordingly, be disappointed by your new power supply.

At the risk of telling you things you already know - aka "teaching my grandmother to suck eggs" - and apologising in advance if that's what I'm doing...

Think of it like this:

  1. Mains socket.
  2. Wall wart.
  3. USB power cable.
  4. USB socket on Pi circuit board.
  5. Circuit-board trace to on-board regulator-in.
  6. On-board regulator-out to 1.8V, 3.3V and 5.0V power rails.
  7. ARM SoC/CPU.

I'm not sure that 5…7 are correct but you get the general idea.

I've taken two measurements. The first is by disconnecting the power cable where it plugs into 4 and connecting it to a controlled load. This is the "can the wall wart deliver to specification?" test. All supplies I have pass this test.

The second measurement is by inserting an inline monitor between 3 and 4, in the same way you have. This is the "what is the Pi actually drawing from the wall wart as it operates?" test. The only real difference between our approaches is mine also logs its observations over Bluetooth so I can capture the data and graph it.

The on-board regulator puts out 1.8V, 3.3 and 5V. The 3.3 and 5V appear at header pins and I've watched those using an oscilloscope. Nothing particularly revealing. Both voltages are quite stable even when the Pi is under heavy compute load.

I haven't tried attaching sensors or other forms of load across the header pins

That leaves the 1.8V power rail and I assume (don't know for a certain fact) that the ARM "compute guts" run on 1.8. I'm basing this guess on the numbers in voltage reports from vcgencmd always being less than 1.8.

Every time I've been lucky enough to catch a currently under-voltage from vcgencmd_power_report the "core" measurement on the Pi 3B+ has been around 1.2V. In normal operation, it seems to be 1.3V on the Pi 3B+.

The Pi 4Bs all seem to have a much lower "core" around 0.85V.

It's that drop to 1.2V on the 3B+ which seems to be the trigger and it suggests (at least to me) that either the on-board regulator is flaky and incapable of sustaining the necessary Watts at 1.8V when the system is under load, or the measurement circuitry is flaky and giving false readings which are triggering currently under-voltage.

All other things being equal, if the measurement circuitry was flaky (so these were false positives) I would not expect that to have a consequential impact on things like network interface ports. The fact that I do see wonky network interface behaviour makes me think it's more likely that the on-board regulator is the culprit. None of this is proof, of course. It may be that some part of the IP stack also watches the currently under-voltage or currently throttled conditions and drops the interfaces to conserve power.

The material point, however, is that it really wouldn't matter if I handed the Pi 3B+ a power supply capable of sustaining 100 Amps at 5.1V. The input side of the on-board regulator still wouldn't draw more than 1 amp while the 1.8V output side still would be incapable of delivering what the Pi 3B+ ARM SoC needed to do its work.

In short, if your Pi 4B has a similar "wonky" on-board regulator then a new PSU might not achieve much.


My question to you is, have you run vcgencmd_power_report to see whether the Pi is moaning about power problems?

If it is then my guess is you'll eventually have to reach the same conclusion about the Pi 4B that I've reached about my 3B+ : useless for anything practical, need a new one, shame about the chip shortage.

If it isn't then we could be barking up the wrong tree entirely. You said you rebuilt with PiBuilder. I've got a collection of 4Bs all built with PiBuilder which are rock solid, interface-wise, plus a lack of issues reporting problems similar to yours. I have no idea how many people use PiBuilder. The only indication of any wider interest is 11 forks and 26 stars and that might not be enough to generalise from, reliably.

If we assume not power and not PiBuilder then the next cab off the rank would be the network interfaces themselves. I'd be considering a USB dongle or hub that included an Ethernet interface. Disabling the built-in interfaces in favour of a dongle/hub interface might let you eliminate the built-in interfaces.


A couple of other thoughts. I spent a good portion of my career wearing a "comms guru" hat. My knowledge is a bit out-of-date but I learned enough to trust absolutely nothing when it comes to comms. I don't trust Ethernet cables and I don't trust switches or switch ports. If at all possible, vary everything you can to see if the problem follows the cable/port/switch.

I'd also be looking for any unexplained oddities like, I have a gigabit switch but the Pi is only running at 100BaseT - why? 'sudo ethtool eth0`

Another thing to do is to make sure you aren't a victim of a duplicate MAC address. These are rare but far from unknown and they play merry hell with everything.

I view the modern fad of random WiFi MACs with deep misgivings. I turn that off as soon as I see it. Seems to me the designers of this nonsense should at least have considered that "home" and "work" networks are places where you don't need this to be enabled. The only time it makes sense is when you're out of range of your normal networks. Still, that's just me. I'm no longer paid the Big Bucks to foist this nonsense on everyone else.

Something like this should get the job done:

  1. Note the MAC address of the Pi's Ethernet interface.
  2. Shutdown the Pi or disconnect it from Ethernet.
  3. On another device, run:

    $ sudo tcpdump ether host «MAC»

    where «MAC» is the MAC address of the Pi's Ethernet interface. Or use WireShark instead of tcpdump.

If something else has the same MAC, it will eventually broadcast and you'll see it.

BlueWings172 commented 2 years ago

Hi Again,

Now that I had sometime to conduct some tests, here is the gist:

1- New power supply is similar to the one I have (model DSM-0530) and it is absolute garbage. It seems to be a common model and this is a warning for anyone who thinks about buying it. You better off running your Pi on hamster power.

2- After I figured out how to export power usage data to excel from my UM25C (doesn't work on Windows or Android but worked on my iPad), I ran tests consisting of a reboot followed by 15 minutes of stress test then continued collecting data for several hours. The 3 scripts you provided were instrumental to immediately see when the Pi starts struggling. I also collected CPU temp, to test theory that the more capable power supply will allow the CPU to reach the highest temps during the stress test.

3- I choose 3 of the USB power sources that I have which I felt are more capable than the others. I ran the test several time over several days and results were consistent.

I used the below lines to install and run the stress test:

sudo apt-get install stress while true; do vcgencmd measure_clock arm; vcgencmd measure_temp; sleep 10; done& stress -c 4 -t 900s

4- Results:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

PSU | DSM-0530 | Samsung USB C Model EP-TA800 | Power Strip -- | -- | -- | -- Notes | Unofficial Raspberry Pi Charger | Phone Charger | Power Strip with USB ports AMP Sepcs | 3A | 3A | 3A Charging Mode as Shown in UM25C | "Unknown" | "Unknown" | Apple 2.1 Undervoltage Warning in Desktop | Yes | Yes | No Reporting Undervoltage Now | Yes | Yes | No during Test but occasionally and briefly during normal Operations Highest AMP | 0.94 | 1.28 | 1.24 Highest Temp Under Stress | 48.7 | 51.1 | 62.3 Max Voltage | 4.95 | 4.93 | 5.164 Max W | 4.518 | 6.27 | 6.43

a. The PSU that is sold as "3A Raspberry Pi power supply" is the worst of all candidates. It couldn't even deliver above 1A or 5V under stress. As a matter of fact the CPU was only 6 to 10 degrees warmer under stress than idle.

b. Max Amp may not necessary be best indicator of the quality and performance of the PSU as there can be spikes of high Amps but overall performance can be inferior. This is apparent in the case of the Samsung phone charger that delivered slightly higher Amps but still caused the Pi to freeze several times per day!

c. Max temp under stress could a better method to compare PSU performance for same Pi.

d. My power strip USB port has been by far the most stable and I didn't have to reboot for 5 days now. That said, the power strip has 2 ports and I discovered that they do not provide the same power levels! The first, that I initially used, had a Charging Mode of Apple 1.5 so I presume it is limited to 1.5A and it did cause the network interfaces to become irresponsible. The second, that I'm using now is "Apple 2.1A" and seems to be fine for past 5 days despite the occasional throttling and under voltage warning from watch_under_voltage.sh and vcgencmd_power_report.sh

e. From the output of the 3 scripts and measurements, it is evident that the Pi is not getting enough juice. Are all the 9 power sources rated 2.4A and above that I used previously, incorrectly labeled ? On the other hand, some clearly scored better than others. It could be a combination of factors. You mentioned that the Pi can have faulty circuitry or regulators. When months ago I tried to hook a monitor to my Pi, I noticed the Micro HDMI ports where moving and I had to hold the cable at an angle the whole time. I would buy another Pi but they sell for over $120 now!

Next Steps:

I have an NVME SSD and an enclosure that have been sitting around for a while and I need to start using them specially that I'm using Home Assistant now. I understand that SSDs require a powered hub so I have to get me one of those. I wonder what you or others are using ? Would a powered hub be able to power both the Pi and peripherals. I'll most definitely get a separate power supply for the Pi. I'm considering either the official Raspberry Power Supply or this one which says 4A. If they are not enough for the Pi, then my last resort would be diesel generator.

I'm getting a new switch and cables for different reasons so If they are a cause, we will know though other devices are running fine. Regarding MAC duplicates, I'm keeping track of all devices that connect to ethernet and WIFI with their MAC addresses and all devices are assigned static IPs based on their MAC.

Thanks for all the help and scripts. I think they should be part of the Pi Builder.

Paraphraser commented 2 years ago

I am using 500GB Samsung T5 on all my Pi 4s - direct connect to a USB-3 port on the Pi. No problem whatsoever powering the SSDs from the Pi. One of the Pi 4s also has a Zigbee2MQTT adapter straight into a USB-2 port so it's clearly capable of powering both external devices.

That said, back when I was still trying to get the 3B+ to work running OctoPrint, I did wonder whether there was power drain from the Pi to the 3D printer. Logic said it should not be the case but I did go to the trouble of getting a powered hub to eliminate the possibility. It made zero difference. I kept the powered hub when I gave up on the 3B+ and put a 4 in its place. To be clear, it's the Pi 4, one USB-3 port going to the Samsung SSD, the other USB-3 port going to the USB-3 hub, and the (USB-2) printer is plugged into the hub. It's over-kill but the whole thing sits on a tripod so the Pi's camera can watch the printer and it was easier to just go on using the hub than trying to re-jig it all.

To summarise, a USB-3 hub will probably do no harm, will probably work, but might not solve the voltage problems.


I could be entirely wrong about this but my impression is that Amps and Watts have very little to do with anything, and that it's all about voltage. A PSU that can maintain 5.1 (or thereabouts) under load is going to do better than something that drops below 5 as soon as a fly lands on the chassis.

But, more importantly, I think it's about what happens once those 5 volts from the PSU reach the circuit board and are regulated down to 3.3, 1.8 and 1.1. On the 4s, I think it's either that 1.8 or 1.1 which is being measured by vcgencmd measure_volts core.

In any event, these are the results of running the power report on my collection of 4Bs and a Zero W2:

$ for H in $ALLPIS ; do ssh $H 'host_summation.sh ; vcgencmd_power_report.sh' ; done 
Raspberry Pi 4 Model B Rev 1.1 running Raspbian GNU/Linux 10 (buster) as 32-bit OS with 64-bit kernel
vcgencmd get_throttled (0x0)
vcgencmd measure_volts:
     core volt=0.8438V
  sdram_c volt=1.1000V
  sdram_i volt=1.1000V
  sdram_p volt=1.1000V
Temperature: temp=47.2'C
Raspberry Pi 4 Model B Rev 1.1 running Debian GNU/Linux 11 (bullseye) as full 64-bit OS
vcgencmd get_throttled (0x0)
vcgencmd measure_volts:
     core volt=0.8768V
  sdram_c volt=1.1000V
  sdram_i volt=1.1000V
  sdram_p volt=1.1000V
Temperature: temp=46.7'C
Raspberry Pi Zero 2 W Rev 1.0 running Debian GNU/Linux 11 (bullseye) as full 64-bit OS
vcgencmd get_throttled (0x0)
vcgencmd measure_volts:
     core volt=1.2438V
  sdram_c volt=1.2000V
  sdram_i volt=1.2000V
  sdram_p volt=1.2250V
Temperature: temp=34.9'C
Raspberry Pi 4 Model B Rev 1.4 running Debian GNU/Linux 11 (bullseye) as full 64-bit OS
vcgencmd get_throttled (0x0)
vcgencmd measure_volts:
     core volt=0.9160V
  sdram_c volt=1.1000V
  sdram_i volt=1.1000V
  sdram_p volt=1.1000V
Temperature: temp=39.9'C
Raspberry Pi 4 Model B Rev 1.2 running Debian GNU/Linux 11 (bullseye) as full 64-bit OS
vcgencmd get_throttled (0x0)
vcgencmd measure_volts:
     core volt=0.8500V
  sdram_c volt=1.1000V
  sdram_i volt=1.1000V
  sdram_p volt=1.1000V
Temperature: temp=48.2'C

These all have "official" Raspberry-Pi-branded power supplies. If you compare/contrast those results with the gist metrics from the 3B+, you'll see that the 3B+ "core" threshold is somewhere below 1.3 while the 4s are happy with 0.8438. How much lower it would need to go before a 4 started complaining is unknown because I've never seen it. The Zero W2 (which boots from SD, not SSD) obviously needs a bit more internal voltage than a 4 but a bit less than a 3B+. I've only got the one Zero W2 so I can't make any judgement about whether this is typical.

I think I mentioned in the gist that I have a cron job running on each machine which will fire off an MQTT message if the throttled condition is sensed. That, in turn, will trigger an email. I'll definitely know if it ever happens.

Anyway, I keep coming back to the theory that it's all about the voltage on the output side of the onboard regulator and that, in turn, has more dependence on the input voltage than the input Watts - ie it can't make up any shortfall in voltage by drawing more current.

Still, I'm a programmer, not an electronics engineer. I have neither the knowledge nor the diagnostic tools to test any of these theories directly. This is all speculation. It could be a truck load of bovine dung.

BlueWings172 commented 2 years ago

I went back to my excel data collected from the UM25C and found out that all power sources that caused the Pi to freeze frequently have never provided 5.1v or 5.0V and at best 4.9x. My power strip USB that's keeping the Pi more stable now does in fact provide constant 5.1V. Does 0.2V make that much of difference? I don't know.

I understand though that when you mentioned voltage you meant specifically the voltage that the regulator outputs not the voltage it receives. But there's not much I do about it. I hope the Raspberry Pi Foundation announces a Pi 5 with better stability but that seems unlikely in the short term.

I loved the mqtt script. I have it running in cron every minute. How to get it only to report when under voltage is occurring right now not just historical?

I have modified it by adding the line below to send me a Telegram message but had to remove it since I'm getting a message every minute. My log file is empty though. Is this normal? I don't think it rotates.

curl -s -o /dev/null -d chat_id=CHAT_ID_HERE -d text="Pi needs juice!" -X POST https://api.telegram.org/botxxx:myapihere/sendMessage

Since the Pi has been stable for past few days, I'm gonna try to use my SSD before Home Assistant murders my new SD card.