SensorsIot / IOTstack

Docker stack for getting started on IOT on the Raspberry PI
GNU General Public License v3.0
1.46k stars 307 forks source link

No Wifi after Backup #405

Closed BlueWings172 closed 2 years ago

BlueWings172 commented 3 years ago

Hello

I'm new to Linux and I have installed the IOTstack on my RPi 4b and it's been working for a couple of months without issues.

Lately I have noticed that sometimes I cannot SSH to my RPi and narrowed down the issue to loss of wifi connectivity following the execution of the IOTstack backup script. I have cron job that runs the script everyday at 7:30 and it seems that when the containers are started again, wifi stops working. I checked /var/log/syslog and found these lines:

    Sep 10 07:33:46 raspberrypi avahi-daemon[362]: Interface wlan0.IPv4 no longer relevant for mDNS.

    Sep 10 07:33:46 raspberrypi avahi-daemon[362]: Leaving mDNS multicast group on interface wlan0.IPv4 with address 192.168.100.50.

    Sep 10 07:33:46 raspberrypi systemd[1]: dhcpcd.service: **Failed with result 'signal**'.

    Sep 10 07:33:46 raspberrypi avahi-daemon[362]: Withdrawing address record for fe80::f847:9cf7:a52c:710d on wlan0.

Sep 10 07:33:46 raspberrypi avahi-daemon[362]: Withdrawing address record for 192.168.100.50 on wlan0.

Note that I have set a static IP for wlan0 in dhcp.conf (192.168.100.50) and I have previously turned off power management on wlan0 and verified that with iwconfig

This problem does not occur everyday but when it does, it happens right after the backup script is ran which in turn restarts the containers.

I don't know what to do and I would appreciate any help.

thanks

Paraphraser commented 3 years ago

Do you have the recommended system patches in place? From the problem you report, my guess is that you might not. If this guess is correct, the underlying problem seems to be that DHCP can't really deal with a whole bunch of containers starting up. The first patch tells DHCP to only concern itself with physical interfaces like Ethernet and WiFi and steer clear of the virtual interfaces that Docker sets up.

You might find a bit more background to the problem in Issue 219. Note that the recommended patch has moved on since the version proposed there - it now deals with the situation where you might create additional physical interfaces such as with USB-to-Ethernet dongles.

Another resource you might find useful is this gist. I run the isc-dhcp-fix.sh script on all my Pis and, while it mostly only fires at reboot, it occasionally fires for no apparent reason, meaning that it has just recovered from some unexplained problem that would have taken the interface down if the script wasn't there to bring it up again.

Finally, all of the foregoing (and a lot more besides) is part of PiBuilder. One component of that "and more besides" includes clobbering IPv6 which caused me no end of grief until I started stomping on it wherever I saw it.

and, yes, I realise that's head-in-the-sand and that the day will come when my ISP says "IPv6 or nothing" and I'll have to live with it but, for now, kicking it to the kerb results in a much quieter life.

Hope this helps.

BlueWings172 commented 3 years ago

You are correct in assuming that I haven't ran the recommended patches. Thanks for pointing that out.

1- I ran the first patch which added allowinterfaces eth*,wlan*at the end of the dhcpcd.conf file. The second seems unrelated but ran it anyway.

2- I also went through the steps in gist though I didn't get "isc-dhcp-fix launched" when i ran grep "isc-dhcp-fix" /var/log/syslog. I got Binary file /var/log/syslog matches instead.

From this message, it does not seem that the script is working they way it should. I followed the steps 3 times and i'm always getting Binary file /var/log/syslog matches

Would you have an idea what might be the issue ?

I also looked at the other links. PiBuilder looks very interesting. Will give it a go once i get my SSD.

Thanks for your help

Paraphraser commented 3 years ago

See this Discord discussion about the "binary file matches" problem.

To summarise:

Paraphraser commented 3 years ago

969D5307-FB09-4323-9927-745E5D44F413

The first three are Raspberry Pi 4s (4GB with external USB3 SSDs), the last is a 1GB 3B+ with SD. My guess is that if you run the file command on your Pi, it'll come back with something other than "text". But if file says "text" while grep says "binary" then we're back to 🤷‍♂️

BlueWings172 commented 3 years ago

Hey man,

you're spot on! the log rotated after midnight and it's working now perfectly. I've added the -a flag and now I can see the log as soon as I SSH to the RPi (i've added the grep command to .profile). I do get so many events on eth0 and very few on wlan0. Basically there is an event every 63 seconds. Is this is because my Ethernet port is not connected? The events on wlan0 show that this command is doing what is supposed to.

Thanks for all your help!

Paraphraser commented 3 years ago

If you don't have Ethernet connected, comment-out that line:

1BB534C5-E48A-4A72-B847-CD41C89E4D63

Leave both sleep calls in place.

BlueWings172 commented 3 years ago

I had an issue with wifi today around 17:45 and when I restarted the Rpi and checked for events, I found many events for eth0 (which is expected because it's not connected) and none for wlan0. Very strange !

it gets better, when I tried to open my /var/log/syslog it was very slow and that was because file size was 207 megabytes and contained over 2.2 million lines! Most of the lines are like the below and get repeated dozens of times each second.

Sep 12 19:49:07 raspberrypi dhclient[21294]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 10
Sep 12 19:49:07 raspberrypi dhclient[9786]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 12
Sep 12 19:49:07 raspberrypi dhclient[18507]: No DHCPOFFERS received.
Sep 12 19:49:07 raspberrypi dhclient[18507]: No working leases in persistent database - sleeping.

I have commented the line Card eth0 and now its says Resetting wlan instead of Resetting eth0.

I looked into my syslog and it is bizarre. the log records up to the moment I restarted the RPi but when it resumes logging, it does so from the past (I restarted it around 18:04 but it began logging with timestamps from 17:45 so 19 min in the past)

. Also while the reboot occurred at 18:04, the lines I could find about a reboot happening have the timestamp of 17:45. There could be a restart at 17:45 that I'm not aware of but why there no mentioning of a reboot when I know it happened at 18:04.

All I am trying to find out is what happened right before wlan0 froze.

I have included my syslog for few minutes around when wlan0 was disconnected and removed all lines similar to the above.

syslog.txt

I understand if this becoming beyond the scope of this thread or forum so if it is the case, I would appreciate it if you could just point me to right direction and I'll take it from there.

Thanks

Paraphraser commented 3 years ago

I could be wrong about this because it might be based on a misreading of the evidence but I have noticed similar strange things with log timestamps and reboots. My working theory is that the Pi's hardware clock is unreliable until NTP gets going. The attachment is from two reboots of a test Pi4. The first grep looks for when NTP syncs and shows a couple of lines around each event. To me, these imply system time is lagging real time until NTP drags it forward. The second grep shows mentions of a "fake" hardware clock. My suspicion is that it goes something like this: periodically, current time is saved to a file. On reboot, this file is used to seed the fake clock. This would account for time appearing to jump back slightly.

913D0177-A962-4590-94F7-268B0A695430

At one point in my journey (of trying to get a handle on WiFi interface freezes) I set up a cron job to copy "yesterday's" system log after the log rolled at midnight - to another machine (a Mac) where I had a better chance of analysing. Bottom line is it turned out to be very difficult to tell "before" and "after" during a reboot cycle. That "EVENT MARK" in the screen shot is part of this. I have aliases for REBOOT and SHUTDOWN which write those log entries before doing the deed, plus markers in /etc/rc.local and .profile.

My conclusion: Raspberry Pi (or, perhaps, Raspbian) WiFi is simply less reliable than Ethernet. All my Pi4s are Ethernet connected and my DNS / SSH configs all point to the Ethernet interface with WiFi as a second option. There is a lot of "stuff" on the web suggesting anything connected to USB can interfere with WiFi. All my Pi4s have external USB3 SSDs sitting right underneath the Pis (which are in aluminium air-cooled cases) so, if that really is a likely cause, they are all candidates.

I do have a 3B+ in a Perspex case which is WiFi-only running OctoPrint running from SD. The USB2 cable running to the 3D printer is thick, long, and well shielded (not chosen for either thickness or shielding - just happened to be the right length). I don't have any trouble with the WiFi going walkabout on that Pi.

But as to whether there is any real substance to "USB interference" as a problem for WiFi - 🤷‍♂️

I can say that, early on, I kept a Zigbee adapter at arm's length from a Pi4 based on this "interference" theory. Seemed to me that it was both USB and radio, so caution was advised. More recently, I've just been keeping the Zigbee adapter directly connected and, although the Pi4 is Ethernet, WiFi is still there and isc-dhcp-fix.sh is still probing it, but it doesn't "fire" any more frequently for the Zigbee adapter being right in WiFi's grille, so to speak. We're back to 🤷‍♂️

Paraphraser commented 3 years ago

By the way, I find it had to designate anything as "off topic" when it comes to IOTstack. As Leonardo da Vinci once said, "realize that everything connects to everything else.” I don't think it matters if it's a hardware issue, a Raspbian issue, a comms issue, or anything else. If it affects some part of IOTstack, I reckon it's fair game. What affects one person has got a reasonable chance of affecting everyone at some point…


You inspired me to do some more experimentation with the system log. My starting position was a RPi4, shutdown, powered off, left for about 45 minutes, then powered-on again.

There's nothing magical in the ~45 minutes - just the time it took to do something else.

Then I grabbed the system log, discarded everything before the shutdown command, and stuck line numbers on the start of each line.

So, here we go:

  1. shutdown command issued (the "1" before "Sep" is the line number):

     1  Sep 13 10:47:51 new-dev pi: EVENT MARK - sudo shutdown -h now
  2. System goes down:

     2  Sep 13 10:47:51 new-dev systemd[1]: Unmounting RPC Pipe File System...
     3  Sep 13 10:47:51 new-dev systemd[1]: Stopping Session 1 of user pi.
     4  Sep 13 10:47:51 new-dev systemd[1]: Condition check resulted in Turns off Raspberry Pi display backlight on shutdown/reboot being skipped.
     5  Sep 13 10:47:51 new-dev systemd[1]: systemd-rfkill.socket: Succeeded.
  3. I think line 5 is the final message from the system going down, with line 6 the first message from the new system coming up about three quarters of an hour later:

     6  Sep 13 10:47:57 new-dev systemd-modules-load[118]: Inserted module 'i2c_dev'
     7  Sep 13 10:47:57 new-dev fake-hwclock[122]: Mon 13 Sep 2021 12:47:53 AM UTC
     8  Sep 13 10:47:57 new-dev systemd-fsck[140]: e2fsck 1.44.5 (15-Dec-2018)
     9  Sep 13 10:47:57 new-dev systemd[1]: Started udev Coldplug all Devices.
    10  Sep 13 10:47:57 new-dev systemd[1]: Starting Helper to synchronize boot up for ifupdown...
    11  Sep 13 10:47:57 new-dev systemd[1]: Started Helper to synchronize boot up for ifupdown.
    12  Sep 13 10:47:57 new-dev systemd-fsck[140]: rootfs: clean, 261521/30038400 files, 3807916/122030086 blocks
    13  Sep 13 10:47:57 new-dev systemd[1]: Started File System Check on Root Device.
  4. Now we skip a huge number of lines while the system comes up.

  5. After power-on, I was sending "pings" to the Pi. As soon as the Pi started responding, I connected via SSH. The following is actually a very good way of getting into a Pi as soon as possible during a reboot:

    $ ping -o -q new-dev ; ssh new-dev

    where "new-dev" is the name of the RPi. The ping just sits there trying every second until it gets a reply. As soon as that happens, the ping terminates and the ssh runs.

  6. The system time is still out of whack because of the time it was powered off. Line 800 is from my .profile running.

    797  Sep 13 10:48:12 new-dev systemd[2147]: Startup finished in 171ms.
    798  Sep 13 10:48:12 new-dev systemd[1]: Started User Manager for UID 1000.
    799  Sep 13 10:48:12 new-dev systemd[1]: Started Session 1 of user pi.
    800  Sep 13 10:48:13 new-dev pi: EVENT MARK - login - 203.0.132.201 53074 22
    801  Sep 13 10:48:26 new-dev systemd[1]: systemd-fsckd.service: Succeeded.
  7. Finally, NTP figures out the current time:

    802  Sep 13 11:35:27 new-dev systemd-timesyncd[317]: Synchronized to time server for the first time 203.57.115.181:123 (2.au.pool.ntp.org).
    803  Sep 13 11:35:29 new-dev colord[2126]: failed to get session [pid 363]: No data available
    804  Sep 13 11:35:33 new-dev systemd[2147]: run-docker-runtime\x2drunc-moby-5994fa4eab8a71656239761cf5d28ec179ce3aeca43679d090aacbb8dafd45e7-runc.fq0alS.mount: Succeeded.
  8. After this, system time agrees with wallclock and advances properly again.

The gap between lines 5 and 6 is pretty interesting. There would seem to be no point in /etc/fake-hwclock.data being written with a value of other than system time at the moment it was written so I think it is safe to assume that:

On power up, any number of things might happen before line 6 that aren't associated with a log entry but, because the line 6 timestamp is 10:47:57, it seems reasonable to suppose that /etc/fake-hwclock.data was read 4 seconds prior at 10:47:53. Keep in mind that, on the way up, system time was some 45 minutes behind wallclock time. The actual time difference between lines 801 and 802 is 47 minutes and 1 second so, if that was the case throughout boot-up, 10:47:53 log-time was 11:34:54 wallclock time.

From all this it seems reasonable to generalise some rules:

  1. The Raspberry Pi behaves as if no time has elapsed between the moment when /etc/fake-hwclock.data is written as the system goes down, and that file is read as the system comes up.
  2. Time will be out of joint until NTP does its thing.
  3. The log is already a bit short on detail when it comes to figuring out things like why WiFi might be wonky, and the "fake hardware clock" issue makes it harder than it should to figure out what might have happened when.
BlueWings172 commented 3 years ago

that's interesting and helpful. In my case, the time reported on line 6 is 19 minutes before line 5.

Based on your post I think what happened is my entire system froze and wifi became irresponsive and RPi no longer updates the time in /etc/fake-hwclock.data and then I restarted it 19 minutes later and it picked up the time from /etc/fake-hwclock.data and started logging from that point until system time was synced with NTP.

So my issue may not be just wifi. I will keep an eye on it and see what happens. I will also switch my ethernet after few days. There seem to be some RTC modules that can be added to a RPi and that are not too expensive. I might play with one of those and see if they can keep the time line in order.

thanks again for all the info and help. Using a RPi has not been as simple as it seemed on the YouTube videos and made me at times appreciate windows more but I have learned a ton trying to find solutions and look for answers and made me enjoy what i'm doing. Reminded me of my Win 3.11 and DOS days.

BlueWings172 commented 3 years ago

So, I'm back again with another dilemma.

WIFI froze today right after containers restart (I have a cron job that runs the IOTstack backup script at 7:30 am every day).

restarted it at 1:37 PM and found only a reset event at 1:17 PM while the system did not have wifi from 7:30 am till 1:37 PM so not sure why there are no resets attempts during all these 6 hours.

But that's not the bizarre thing though.

Looked at my syslog and found the below lines with wlan is mentioned starting after containers restart (look at lines in bold ):

Sep 14 07:35:51 raspberrypi dhclient[612]: receive_packet failed on wlan0: Network is down
Sep 14 07:35:51 raspberrypi avahi-daemon[365]: Interface wlan0.IPv6 no longer relevant for mDNS.
Sep 14 07:35:51 raspberrypi avahi-daemon[365]: Leaving mDNS multicast group on interface wlan0.IPv6 with address fe80::f767:9ad7:a52c:710d. Sep 14 07:35:51 raspberrypi avahi-daemon[365]: Interface wlan0.IPv4 no longer relevant for mDNS.
Sep 14 07:35:51 raspberrypi avahi-daemon[365]: Leaving mDNS multicast group on interface wlan0.IPv4 with address 192.168.100.10.
Sep 14 07:35:51 raspberrypi avahi-daemon[365]: Withdrawing address record for fe80::f767:9ad7:a52c:710d on wlan0.
Sep 14 07:35:51 raspberrypi avahi-daemon[365]: Withdrawing address record for 192.168.100.50 on wlan0.
Sep 14 07:35:51 raspberrypi avahi-daemon[365]: Withdrawing address record for 192.168.100.10 on wlan0.

Sep 14 13:17:06 raspberrypi dhcpcd-run-hooks[496]: wlan0: starting wpa_supplicant
Sep 14 13:17:07 raspberrypi dhcpcd[429]: wlan0: connected to Access Point Sep 14 13:17:07 raspberrypi dhcpcd[429]: wlan0: waiting for carrier Sep 14 13:17:07 raspberrypi root: isc-dhcp-fix resetting wlan0
Sep 14 13:17:08 raspberrypi dhclient[635]: DHCPREQUEST for 192.168.100.10 on wlan0 to 255.255.255.255 port 67
Sep 14 13:17:08 raspberrypi dhcpcd[429]: wlan0: carrier acquired
Sep 14 13:17:08 raspberrypi dhcpcd[429]: wlan0: connected to Access Point `HOMENET' Sep 14 13:17:08 raspberrypi kernel: [ 13.249117] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready Sep 14 13:17:08 raspberrypi dhcpcd[429]: wlan0: IAID 32:95:fb:8e
Sep 14 13:17:08 raspberrypi dhcpcd[429]: wlan0: adding address fe80::f767:9ad7:a52c:710d
Sep 14 13:17:08 raspberrypi avahi-daemon[373]: Joining mDNS multicast group on interface wlan0.IPv6 with address fe80::f767:9ad7:a52c:710d. Sep 14 13:17:08 raspberrypi avahi-daemon[373]: New relevant interface wlan0.IPv6 for mDNS.
Sep 14 13:17:08 raspberrypi avahi-daemon[373]: Registering new address record for fe80::f767:9ad7:a52c:710d on wlan0.*. Sep 14 13:17:08 raspberrypi dhcpcd[429]: wlan0: probing address 192.168.100.50/24
Sep 14 13:17:09 raspberrypi dhcpcd[429]: wlan0: soliciting an IPv6 router
Sep 14 13:17:13 raspberrypi dhcpcd[429]: wlan0: using static address 192.168.100.50/24
Sep 14 13:17:13 raspberrypi avahi-daemon[373]: Joining mDNS multicast group on interface wlan0.IPv4 with address 192.168.100.50.
Sep 14 13:17:13 raspberrypi dhcpcd[429]: wlan0: adding route to 192.168.100.0/24
Sep 14 13:17:13 raspberrypi avahi-daemon[373]: New relevant interface wlan0.IPv4 for mDNS.
Sep 14 13:17:13 raspberrypi dhcpcd[429]: wlan0: adding default route via 192.168.100.1
Sep 14 13:17:13 raspberrypi avahi-daemon[373]: Registering new address record for 192.168.100.50 on wlan0.IPv4. Sep 14 13:17:14 raspberrypi dhclient[635]: DHCPREQUEST for 192.168.100.10 on wlan0 to 255.255.255.255 port 67
Sep 14 13:17:14 raspberrypi avahi-daemon[373]: Registering new address record for 192.168.100.10 on wlan0.IPv4. Sep 14 13:17:22 raspberrypi dhcpcd[429]: wlan0: no IPv6 Routers available

I have set a static IP (192.168.100.50)in my dhcpcd.conf file but it also taking dynamic (192.168.100.10). Here is the weird part: I can ping both IPs and I can SSH both IPs and my Ethernet cable is not connected.

How can wlan0 has 2 active IPs ? and could this be the source of my troubles ?

BlueWings172 commented 3 years ago

The dynamic IP does not show when I ran iwconfigor ifconfigbut the static ip does.

hostname -I does list both IPs

This is the output of ip addr command:

3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether dc:a6:32:95:fb:8e brd ff:ff:ff:ff:ff:ff inet 192.168.100.50/24 brd 192.168.100.255 scope global noprefixroute wlan0 valid_lft forever preferred_lft forever inet 192.168.100.10/24 brd 192.168.100.255 scope global secondary dynamic wlan0 valid_lft 6033318sec preferred_lft 6033318sec inet6 fe80::f767:9ad7:a52c:710d/64 scope link valid_lft forever preferred_lft forever

Paraphraser commented 3 years ago

WIFI froze today right after containers restart (I have a cron job that runs the IOTstack backup script at 7:30 am every day).

The inference is that your backup script takes down your containers and starts them again afterwards. We should probably drill into why you are doing that.

It's a long time since I decided to go my own way with backup and restore (as in Paraphraser/IOTstackBackup). I have no idea what the supplied scripts do these days but my backup script runs without needing to take the stack down.


I see a lot of IPv6 in there. I think you're better off without it:

  1. Use sudo and your favourite text editor to edit:

    /etc/sysctl.conf
  2. Find the line:

    #net.ipv6.conf.all.forwarding=1
  3. Add these lines after that line:

    # From https://itsfoss.com/disable-ipv6-ubuntu-linux/
    net.ipv6.conf.all.disable_ipv6=1
    net.ipv6.conf.default.disable_ipv6=1
    net.ipv6.conf.lo.disable_ipv6=1
  4. Reboot.

Many guides (like the one in that "itsfoss" URL) say "and then run sudo sysctl -p" after step 3. That does work but there's a wrinkle. If you are connected via SSH and your connection was established via IPv6 then you pull the rug out from under yourself. It's much safer to reboot and have IPv6 disabled as the system comes up. Then, when your client re-attempts SSH it will be forced to use IPv4.


Question. Are you running hass.io? If yes, that implies Network Manager and it could be part of the problem.

My guess is you'll answer "no" because what you report doesn't quite fit my understanding of what Network Manager gets up to on WiFi with random MAC addresses but it's worth being sure.


It looks to me like what is going on is that your static address is working and DHCP is also asking for a lease for the same interface.

In principle, there is nothing stopping a NIC (network interface card) from having two IP addresses in the same broadcast domain (aka "subnet"). It's really no different from having two NICs (Ethernet and WiFi) being attached to the same broadcast domain.

I'd call it "unusual" and I might even add the words "unhelpful" and "confusing" but, in data-comms terms, I don't think it's any big deal.

My advice? Get rid of the static IP address, tell DHCP to always associate the same IP address with the Pi's WiFi MAC (and, while you are there, set up an association for the Pi's Ethernet interface too), and then let DHCP do all the work.

I practice what I preach. On my home network I only have two devices with static IP addresses: my router (.1) and my local DNS servers (.65).

The router is my DHCP server. It hands out dynamic leases in the range .193 … .253, and uses addresses below .193 for fixed MAC-to-IP bindings. My standard lease expiration time is 864000 seconds (10 days).

Strictly speaking, the DNS server could also get its IP address from DHCP but I follow "old school" precepts which hold that DNS servers should be static.

To answer your original question, though, I kinda doubt this is the source of your troubles.

My instinct is something about the process of taking the stack down and bringing it up again for the backup. It sounds/feels the same as the original problem that led us (IOTstackers) to the first patch (the "allowinterfaces" thing). After a reboot, Docker would try to start all the containers that had been running prior to the reboot, and something about all those doing DHCP, unnecessarily, and in a mad rush, would stop DHCP from happening for the WiFi and/or Ethernet interfaces, and you then had no way of reaching the Pi. As far as I could tell from the logs, the Pi wasn't ever "frozen" or "hung". It just couldn't do comms and, seeing as that was its raison d'être as an IoT hub, it may as well have been frozen.

I'm now starting to wonder if the first patch only deals with IPv4 DHCP and not IPv6 DHCP. I really don't know one way or the other. I just reckon things go better if you belt IPv6 right out of the ballpark.

Paraphraser commented 2 years ago

No further response. Should probably be considered stale.