alsmith / multicast-relay

Relay multicast and broadcast packets between interfaces.
GNU General Public License v3.0
304 stars 47 forks source link

Script needs to be restarted regularly #88

Open chmac opened 1 month ago

chmac commented 1 month ago

Firstly, thank you kindly for this script. After weeks of fighting with Sonos across vlans, ran the script, and it all worked immediately! 🧡🧡🧡

I saw this comment about having to restart the script regularly. But it suggests every few minutes. My experience is that the script works great for some period of time, maybe 24h or so. Then Sonos stops working, and if I restart the script, Sonos immediately works.

I was running the script with --foreground in a terminal, and no errors were generated. I've added --verbose and I'm logging to a file now. I will try to see what's going on when Sonos stops working.

Any other tips on how to debug this?

chmac commented 1 month ago

Okay, so by now the relay has stopped working. The last packet relayed was May-01 14:06:32. There's no error that I can see. Here's the last piece of the logs.

Logs ``` May-01 14:04:41 multicast-relay.py INFO: [SSDP] Relayed 807 bytes from 192.168.3.63:49456 on vlan3 [ttl 1] to 239.255.255.250:1900 via vlan2/192.168.2.2 May-01 14:06:30 multicast-relay.py INFO: [SSDP] Relayed 203 bytes from 192.168.2.209:56675 on vlan2 [ttl 1] to 239.255.255.250:1900 via vlan3/192.168.3.2 May-01 14:06:30 multicast-relay.py INFO: [SSDP] Relayed 203 bytes from 192.168.2.242:59990 on vlan2 [ttl 1] to 239.255.255.250:1900 via vlan3/192.168.3.2 May-01 14:06:31 multicast-relay.py INFO: [SSDP] Relayed 203 bytes from 192.168.2.209:56675 on vlan2 [ttl 1] to 239.255.255.250:1900 via vlan3/192.168.3.2 May-01 14:06:31 multicast-relay.py INFO: [SSDP] Relayed 203 bytes from 192.168.2.242:59990 on vlan2 [ttl 1] to 239.255.255.250:1900 via vlan3/192.168.3.2 May-01 14:06:31 multicast-relay.py INFO: [SSDP] Relayed 756 bytes from 192.168.3.63:47674 on vlan3 [ttl 1] to 239.255.255.250:1900 via vlan2/192.168.2.2 May-01 14:06:31 multicast-relay.py INFO: [SSDP] Relayed 809 bytes from 192.168.3.63:47674 on vlan3 [ttl 1] to 239.255.255.250:1900 via vlan2/192.168.2.2 May-01 14:06:32 multicast-relay.py INFO: [SSDP] Relayed 759 bytes from 192.168.3.63:47674 on vlan3 [ttl 1] to 239.255.255.250:1900 via vlan2/192.168.2.2 May-01 14:06:32 multicast-relay.py INFO: [SSDP] Relayed 203 bytes from 192.168.2.209:56675 on vlan2 [ttl 1] to 239.255.255.250:1900 via vlan3/192.168.3.2 May-01 14:06:32 multicast-relay.py INFO: [SSDP] Relayed 203 bytes from 192.168.2.242:59990 on vlan2 [ttl 1] to 239.255.255.250:1900 via vlan3/192.168.3.2 ```

The first entry was May-01 07:32. So it seems to have run for ~7h or so. I guess I could have a cron job to restart it every 3h. I guess it's a stateless process. So there's an easy workaround. Just wondering if I've misconfigured something that could result in this behaviour.

alsmith commented 1 month ago

I am puzzled. My relay instance runs for months if not years, I never have to restart it. Maybe you can run it with strace and see what it’s up to right before it gives up ?

chmac commented 1 month ago

Yeah, it seems odd. Given the number of people running this script with success, I imagine there must be an issue with my setup. I've started the script again with strace -o file.log so hopefully when it stops in ~7h or so I should learn something...

chmac commented 1 month ago

Last log entry from the script was May-03 17:32:43. But tail -f on the strace log file is still screaming output. I don't see any timestamps in the strace output, so I'm not sure how to figure out what it was doing when it stopped working. Any recommendations on that?

Maybe there's a different way to run strace that produces more useful output?

chmac commented 1 month ago

This morning the script stopped after ~3m. So I guess something is happening that's causing the error condition. But I'm not sure how to investigate what that is!

Is there any kind of debug environment variable or something I can use to generate more detailed logs from the program? Or can I upload the 130MB strace file somewhere? :-)

chmac commented 1 month ago

I've just added --k8sport so I will see if that responds OK when it's in the broken state.

chmac commented 1 month ago

Well that was quick. The status report currently returns OK and I see that logged by the script, but nothing else. No packets are being forwarded, and sonos doesn't work. This time the script only ran for ~5 minutes before getting into this state.

alsmith commented 1 month ago

I guess it's not stuck entirely, although it has to be said that the k8s check does run in its own separate thread.

What it's doing after it gets stuck would be interesting to see, especially if it's just repeating the same thing over and over again. Also what it was up to for 30s before and after the last packet that it did forward could be interesting. I'm curious to see if we can find out just exactly what's happening here.

chmac commented 1 month ago

@alsmith Any suggestions how I can figure that out? Can I post some logs?

I've added -t to the strace command. It's generating around 30 lines per second. But once the script stops again, hopefully then it's possible to correlate some of the strace entries with the script's log output.

chmac commented 1 month ago

Okay, got some logs which are potentially easier to correlate.

Script log ``` May-06 12:01:30 multicast-relay.py INFO: [SSDP] Relayed 204 bytes from 192.168.2.147:62479 on vlan2 [ttl 1] to 239.255.255.250:1900 via vlan3/192.168.3.2 May-06 12:01:30 multicast-relay.py INFO: [SSDP] Relayed 204 bytes from 192.168.2.209:56926 on vlan2 [ttl 1] to 239.255.255.250:1900 via vlan3/192.168.3.2 May-06 12:01:30 multicast-relay.py INFO: [SSDP] Relayed 204 bytes from 192.168.2.242:50383 on vlan2 [ttl 1] to 239.255.255.250:1900 via vlan3/192.168.3.2 May-06 12:01:31 multicast-relay.py INFO: [SSDP] Relayed 204 bytes from 192.168.2.147:62479 on vlan2 [ttl 1] to 239.255.255.250:1900 via vlan3/192.168.3.2 May-06 12:01:40 multicast-relay.py INFO: [SSDP] Relayed 275 bytes from 192.168.2.209:56791 on vlan2 [ttl 4] to 239.255.255.250:1900 via vlan3/192.168.3.2 May-06 12:01:40 multicast-relay.py INFO: [SSDP] Relayed 275 bytes from 192.168.2.242:56791 on vlan2 [ttl 4] to 239.255.255.250:1900 via vlan3/192.168.3.2 May-06 12:01:41 multicast-relay.py INFO: [SSDP] Relayed 275 bytes from 192.168.2.209:56791 on vlan2 [ttl 4] to 239.255.255.250:1900 via vlan3/192.168.3.2 May-06 12:01:41 multicast-relay.py INFO: [SSDP] Relayed 275 bytes from 192.168.2.242:56791 on vlan2 [ttl 4] to 239.255.255.250:1900 via vlan3/192.168.3.2 May-06 12:02:00 multicast-relay.py INFO: [SSDP] Relayed 806 bytes from 192.168.3.63:49456 on vlan3 [ttl 1] to 239.255.255.250:1900 via vlan2/192.168.2.2 ```

As the (excerpt) from the systrace log is 135K I pasted it here instead of in the issue.

chmac commented 1 month ago

I've tried adding some more log entries to the code, to see if I can figure out where it's getting stuck. It doesn't seem like there's any kind of debug level logging option. I'll try adding a bunch more logger.info() calls and see what I can figure out...

chmac commented 1 month ago

Okay, adding a new log entry before line 595 I see that this block starts to return true for everything. So I get thousands of this log entry, and no packets are being forwarded.

https://github.com/alsmith/multicast-relay/blob/461d1c97f3aaf0ab9f8996e1a8f24c6b8204d6b4/multicast-relay.py#L593-L595

alsmith commented 1 month ago

I'm starting to suspect that there's something in the network that's setting off a broadcast storm, to such an extent that the relay is getting so bogged down handling traffic that it doesn't care about, enough that it never gets to handling traffic that it should care about. Can you share your CLI arguments for where you start the relay ?

chmac commented 1 month ago

@alsmith It could be, it seems like it receives thousands of packets a second. Which I guess is not usual?

sudo python3 multicast-relay.py --interfaces "vlan2" "vlan3" --foreground --noMDNS --k8sport 2000 --verbose | tee -a mr_$(dr).log

That's a copy and paste of the exact startup command. The two interfaces vlan2 and vlan3 are on the same NIC but each on a different subnet and vlan. Could it be that something is messed up with how I created the vlan interfaces and maybe they're on the same vlan or something? They definitely have different IPs on different subnets, but I haven't done much with vlans on Ubuntu in the past, so I could easily have messed something up.

Thanks for the effort you've put into this issue, I know it's hard work producing, releasing, and most of all, supporting open source software, so I appreciate that you're so responsive. 🧡

chmac commented 1 month ago

Another thought, this machine is also running docker, and so there's a ton of virtual docker interfaces on the machine. I'm not sure if that's relevant or not. I know that docker messes with iptables such that the ufw firewall has no effect when run on the same host as docker by default. Maybe there's something similar going on here? I could always try running this on a separate machine, or even on my UDM-SE if docker might be the issue.

alsmith commented 1 month ago

One further thought is to take a tcpdump on vlan2 and/or vlan3 to see just what's going on when it gets into the stuck state. tcpdump -eni vlan2 would do the trick so you can also see the ethernet MACs. Maybe it's clear just what packet(s) are on the network at that time, I'm expecting that there are just many many broadcast packets on one or maybe even both interfaces.

chmac commented 1 month ago

Tried running tcpdump alongside the script. I got excited thinking that my VPN was flooding the script because it was generating a ton of UDP traffic, but switched that off, and no change. I don't see any uptick in the number of packets per minute (somewhere around 100 per minute) when the script stops working. I'm not sure what a broadcast storm looks like, but I assume it would result in more than 100 packets per minute?

Somehow it seems that all received packets start to get skipped here:

https://github.com/alsmith/multicast-relay/blob/461d1c97f3aaf0ab9f8996e1a8f24c6b8204d6b4/multicast-relay.py#L593-L595

I've added the dstAddr and dstPort to my log output so maybe that will shed some light on why the packets are getting skipped. From checking this and the tcpdump output, it looks like the script no longer receives packets on port 1900. Could the listener have somehow died without the script noticing?

That's what my investigation suggests. Somehow the script is not receiving incoming packets on 1900. I'm also not getting any kind of error message in the logs. My tcpdump shows packets on port 1900 like so:

12:15:07.208962 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:07.209035 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:07.209221 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:08.214935 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:08.214940 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:08.214943 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:09.217594 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:09.217599 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:09.217601 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:10.221634 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:10.221639 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:10.221642 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:11.225694 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193
12:15:11.225769 00:la:pt:op:ma:ca > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 235: 192.168.2.209.60362 > 255.255.255.255.1900: UDP, length 193

Those all seem to be generated when I start the sonos app. But none of them are seen by the script. Does this shed any light on what might be going on?

chmac commented 1 week ago

Have you given up on this issue?

I have since run the script on the unifi dream machine se and it runs without issue, so I guess I'll just do that and restart it after reboots and so on.

alsmith commented 1 week ago

It's hard to know where to go from here.

What did you find out from adding dstAddr and dstPort to the logs, as well as the output of self.match()?

If select() starts not returning the packets that we're interested in, especially after a period of time, then it could be an issue elsewhere, such as within python or libc.