Open chmac opened 1 month ago
Okay, so by now the relay has stopped working. The last packet relayed was May-01 14:06:32
. There's no error that I can see. Here's the last piece of the logs.
The first entry was May-01 07:32
. So it seems to have run for ~7h or so. I guess I could have a cron job to restart it every 3h. I guess it's a stateless process. So there's an easy workaround. Just wondering if I've misconfigured something that could result in this behaviour.
I am puzzled. My relay instance runs for months if not years, I never have to restart it. Maybe you can run it with strace
and see what it’s up to right before it gives up ?
Yeah, it seems odd. Given the number of people running this script with success, I imagine there must be an issue with my setup. I've started the script again with strace -o file.log
so hopefully when it stops in ~7h or so I should learn something...
Last log entry from the script was May-03 17:32:43
. But tail -f
on the strace log file is still screaming output. I don't see any timestamps in the strace output, so I'm not sure how to figure out what it was doing when it stopped working. Any recommendations on that?
Maybe there's a different way to run strace
that produces more useful output?
This morning the script stopped after ~3m. So I guess something is happening that's causing the error condition. But I'm not sure how to investigate what that is!
Is there any kind of debug environment variable or something I can use to generate more detailed logs from the program? Or can I upload the 130MB strace file somewhere? :-)
I've just added --k8sport
so I will see if that responds OK
when it's in the broken state.
Well that was quick. The status report currently returns OK
and I see that logged by the script, but nothing else. No packets are being forwarded, and sonos doesn't work. This time the script only ran for ~5 minutes before getting into this state.
I guess it's not stuck entirely, although it has to be said that the k8s check does run in its own separate thread.
What it's doing after it gets stuck would be interesting to see, especially if it's just repeating the same thing over and over again. Also what it was up to for 30s before and after the last packet that it did forward could be interesting. I'm curious to see if we can find out just exactly what's happening here.
@alsmith Any suggestions how I can figure that out? Can I post some logs?
I've added -t
to the strace
command. It's generating around 30 lines per second. But once the script stops again, hopefully then it's possible to correlate some of the strace entries with the script's log output.
Okay, got some logs which are potentially easier to correlate.
As the (excerpt) from the systrace log is 135K I pasted it here instead of in the issue.
I've tried adding some more log entries to the code, to see if I can figure out where it's getting stuck. It doesn't seem like there's any kind of debug level logging option. I'll try adding a bunch more logger.info()
calls and see what I can figure out...
Okay, adding a new log entry before line 595 I see that this block starts to return true for everything. So I get thousands of this log entry, and no packets are being forwarded.
I'm starting to suspect that there's something in the network that's setting off a broadcast storm, to such an extent that the relay is getting so bogged down handling traffic that it doesn't care about, enough that it never gets to handling traffic that it should care about. Can you share your CLI arguments for where you start the relay ?
@alsmith It could be, it seems like it receives thousands of packets a second. Which I guess is not usual?
sudo python3 multicast-relay.py --interfaces "vlan2" "vlan3" --foreground --noMDNS --k8sport 2000 --verbose | tee -a mr_$(dr).log
That's a copy and paste of the exact startup command. The two interfaces vlan2
and vlan3
are on the same NIC but each on a different subnet and vlan. Could it be that something is messed up with how I created the vlan interfaces and maybe they're on the same vlan or something? They definitely have different IPs on different subnets, but I haven't done much with vlans on Ubuntu in the past, so I could easily have messed something up.
Thanks for the effort you've put into this issue, I know it's hard work producing, releasing, and most of all, supporting open source software, so I appreciate that you're so responsive. 🧡
Another thought, this machine is also running docker, and so there's a ton of virtual docker interfaces on the machine. I'm not sure if that's relevant or not. I know that docker messes with iptables such that the ufw firewall has no effect when run on the same host as docker by default. Maybe there's something similar going on here? I could always try running this on a separate machine, or even on my UDM-SE if docker might be the issue.
One further thought is to take a tcpdump on vlan2 and/or vlan3 to see just what's going on when it gets into the stuck state. tcpdump -eni vlan2
would do the trick so you can also see the ethernet MACs. Maybe it's clear just what packet(s) are on the network at that time, I'm expecting that there are just many many broadcast packets on one or maybe even both interfaces.
Tried running tcpdump alongside the script. I got excited thinking that my VPN was flooding the script because it was generating a ton of UDP traffic, but switched that off, and no change. I don't see any uptick in the number of packets per minute (somewhere around 100 per minute) when the script stops working. I'm not sure what a broadcast storm looks like, but I assume it would result in more than 100 packets per minute?
Somehow it seems that all received packets start to get skipped here:
I've added the dstAddr
and dstPort
to my log output so maybe that will shed some light on why the packets are getting skipped. From checking this and the tcpdump output, it looks like the script no longer receives packets on port 1900. Could the listener have somehow died without the script noticing?
That's what my investigation suggests. Somehow the script is not receiving incoming packets on 1900. I'm also not getting any kind of error message in the logs. My tcpdump shows packets on port 1900 like so:
12:15:07.208962 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:07.209035 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:07.209221 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:08.214935 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:08.214940 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:08.214943 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:09.217594 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:09.217599 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:09.217601 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:10.221634 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:10.221639 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:10.221642 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:11.225694 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:11.225769 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
Those all seem to be generated when I start the sonos app. But none of them are seen by the script. Does this shed any light on what might be going on?
Have you given up on this issue?
I have since run the script on the unifi dream machine se and it runs without issue, so I guess I'll just do that and restart it after reboots and so on.
It's hard to know where to go from here.
What did you find out from adding dstAddr and dstPort to the logs, as well as the output of self.match()?
If select() starts not returning the packets that we're interested in, especially after a period of time, then it could be an issue elsewhere, such as within python or libc.
Firstly, thank you kindly for this script. After weeks of fighting with Sonos across vlans, ran the script, and it all worked immediately! 🧡🧡🧡
I saw this comment about having to restart the script regularly. But it suggests every few minutes. My experience is that the script works great for some period of time, maybe 24h or so. Then Sonos stops working, and if I restart the script, Sonos immediately works.
I was running the script with
--foreground
in a terminal, and no errors were generated. I've added--verbose
and I'm logging to a file now. I will try to see what's going on when Sonos stops working.Any other tips on how to debug this?