mdns-repeater causes ChromeCast Audio devices to cease broadcasting mDNS responses when interface MTU exceeded

gjbadros commented 4 years ago

Because I'm not sure of the right set of fixes, I'm filing this issue with each of: Google Chromecast/Home Team (via email), https://github.com/jstasiak/python-zeroconf, https://github.com/kennylevinsen/mdns-repeater, https://github.com/home-assistant/plugin-multicast, and https://github.com/home-assistant/core

I just spent the weekend tracking this down as I apparently started using mdns-repeater unwittingly due to a change in HomeAssistant's hass.io and its new multicast plugin (https://github.com/home-assistant/plugin-multicast). That resulted in a revival of some terrible instability in my 30+ Google ChromeCast Audio (CCA) devices in my home -- the problem had gone away a couple months ago and I'd attributed it to IGMP issues, but I peeled the onion and this is what I found.

BUG #1 CHROMECAST AUDIO DEVICE PROCESS CRASHES ENDING mDNS:

The mDNS announcing process of Google Chromecast Audios (a discontinued product, unfortunately) dies when triggered by the steps below using mdns-repeater and python zeroconf (via Home Assistant in my case).

The CCA crash manifests itself by the multicast messages announcing the CCA and its audio groups stopping appearing. E.g., if you do

tcpdump -npi eno1 port 5353 and host [CCA_IP]

you'll see a couple of PTR responses coming from the devices every 10 seconds, announcing something like:

Chromecast-Audio-dc.a.................f._googlecast._tcp.local: type TXT, class IN, cache flush

when the set of steps below happens, these announcements end until either:

1) the CCA reboots (power cycle); or

2) the CCA is forced to switch to a new WAP; e.g., I have a script that forces a client to reconnect to the WAP in order to kick this mDNS announcing process back on.

When those announcements cease, the Google Home app on Android stops showing "Play Music" links under that device in the display. HOWEVER, there is a per-physical-WAP (based off the MAC of the WAP, not SSID, so it is not shared across multiple mesh-networked WAPs on the same SSID) cache for Google Home, so you won't see the problem happen immediately. You instead, have to go to another room, ensure the phone is connected to a new WAP, and then see that "Play Music" will no longer show up for that device.

It's worthing noting that the TCP socket interface to each ChromeCast Audio device is still working after the MDNS announcing process has died. E.g., you can still play music and control the device via TCP APIs, you just can't discover the device via mDNS.

BUG #1 SUMMARY - HIGH SEVERITY but probably NO FIX: Google ChromeCast Audio must not crash due to bad network data. (But this probably won't get fixed since Google Home/Mini do not have the bug and the CCA is a discontinued product.)

BUG #2 PYTHON ZEROCONF SHOULD NOT SEND HUGE PACKETS

I have 30+ ChromeCast Audio devices and over 80+ Google casting devices. A query response to _googlecast._tcp.locl. results in a response that's almost 4KB, far larger than the 1500 MTU on most ethernet switches. E.g., if I modify examples/browser.py to interrogate like so:

browser = ServiceBrowser(zeroconf, "_googlecast._tcp.local.", handlers=[on_service_state_change])

zeroconf will then publish those 4KB mDNS responses. They, of course, get IP fragmented and that seems to be find when multicasting directly to the CCAs and other devices. However, RFC 6762 (https://tools.ietf.org/html/rfc6762) section 17 states some requirements for Multicast DNS Message Size, and the fourth paragraph reads:

"A Multicast DNS packet larger than the interface MTU, which is sent using fragments, MUST NOT contain more than one resource record."

Larger than the interface MTU seems to me to mean that these Responses must limit themselves to no more than 1500 octets (except in the special case of a long single record that's too big). That's not the issue here -- the responses causing the crash are, e.g., 59 Resource Records (RR) in the answer (not a single long one).

For whatever reason, that problem alone is not causing the ChromeCast Audios to crash, but I strongly suspect that fixing this problem would fix the stack. I believe these MUST be broken up into separate UDP packets of length <= 1500 (the interface MTU) at the application layer (rather than using IP fragmentation).

You can reproduce this using avahi-publish to create lots of records in a subdomain and then browsing that subdomain. The total length of the DNS records should exceed 2KB (for good measure to be sure it's big enough).

BUG #2 SUMMARY - MEDIUM SEVERITY AND STRAIGHTFORWARD FIX: python zeroconf MUST adhere to RFC 6762.

BUG #3 MDNS-REPEATER SOMEHOW TICKLES BUG #1 WHEN PRESENTED WITH MDNS IP FRAGMENTS

I've not investigated this thoroughly, but I suspect it's either due to some kind of UDP storm due to a cycle that crashes that CCAs because of the fragmentation, or some kind of packet rewriting.

The only other open issue on home-assistant/plugin-multicast seems possibly relevant (https://github.com/home-assistant/plugin-multicast/issues/1) and jesserockz's note at the end is worth understanding/trying. I don't think the mdns-repeater code should be mirroring all the interfaces, so if it is, that's a bug.

Note that in the configuration where I can reproduce this, mdns-repeater is running inside Home Assistant's hass.io plugin called the home-assistant/plugin-multicast

I work around it by running a shell inside that docker environment and changing the run command to comment out the running of mdns-repeater (since just docker stop-ping that container results in the hassio supervisor restarting the container).

It may be worth noting that the machine on which hassio has many network interfaces:

$ ifconfig # output follows
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
eno2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
enp2s0f0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
enp2s0f1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
hassio: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
veth0ac5dfe: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth0ff2059: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth1b05ec4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth3003e6f: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth347b241: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth54a968f: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth748acbe: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
vethc5fab43: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
vethedd7c47: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
virbr0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500

BUG #3 SUMMARY - UNCERTAIN SEVERITY AND FIX. Confirm it's behaving as expected when there are multiple interfaces and in the presence of UDP packets undergoing IP fragmentation.

Summary: I propose fixing python-zeroconf as the quickest and easiest change, and ideally someone more in tune with what homeassistant is trying to do with mdns-repeater could figure out the right fix to mdns-repeater and/or the way the multicast plugin is configured.

Let me know what more information you need.

kennylevinsen commented 4 years ago

This is quite the wall of text. :/

A good starting point would be using wireshark to monitor both the source and destination subnets for the mdns broadcast that crashes your ChromeCast Audio—that is, the subnet that the broadcast originated on, and the subnet that mdns-repeater repeated it to. I don't have any devices that crash from bad mDNS packages, so you're a bit on your own with regards to finding the fault.

Note that mdns-repeater just blindly copies UDP packages targetting the mdns address from one interface to another. It forwards to all interfaces that have been specified by name on the command-line. That all interfaces are monitored is not a bug, it's just the current behavior. Easiest way to filter is using the blacklists.

gjbadros commented 4 years ago

Yes I think it's just the python-zeroconf that homeassistant embeds being exposed back onto the whole subnet. I don't actually now think mdns-repeater is mangling packets, it's just exposing the bad python-zeroconf-generated mdns responses from homeassistant in a docker container back onto the network where the Chromecast audios are dying.

Thanks for reading the wall :)

Greg

On Thu, May 7, 2020, 3:18 PM Kenny Levinsen notifications@github.com wrote:

This is quite the wall of text. :/

A good starting point would be using wireshark to monitor both the source and destination subnets for the mdns broadcast that crashes your ChromeCast Audio—that is, the subnet that the broadcast originated on, and the subnet that mdns-repeater repeated it to. I don't have any devices that crash from bad mDNS packages, so you're a bit on your own with regards to finding the fault.

Note that mdns-repeater just blindly copies UDP packages targetting the mdns address from one interface to another. It does this between all interfaces that have been specified by name on the command-line.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kennylevinsen/mdns-repeater/issues/6#issuecomment-625525213, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALOHTM7BFLRWEAQTJCQ533RQMXRRANCNFSM4MZKEY7Q .

sbeckeriv commented 4 years ago

I have been running this in a docker on unraid to support my broken up network for chromecast ultra to stream plex. I have seen, often enough, that the audio for a stream for a video will not work. I thought it was an issue with chromecast to the speaker but this wall of text might also be my issue. unlike gjbadros i have just been restarting my chromecast and blaming sonos. I look forward to anything that comes out of this issue or to test any potential solutions.

Thanks again! Becker

kennylevinsen commented 4 years ago

See https://github.com/jstasiak/python-zeroconf/issues/245 and https://github.com/jstasiak/python-zeroconf/pull/248.

Closing this as there were no mdns-repeater issues identified.

gjbadros commented 4 years ago

The fix was isolated to python-zeroconf so Stephen you're welcome to try the latest version of that and see if it makes a difference. I suspect it doesn't affect your scenario since the problem was about Google ChromeCast Audio (CCAs not Ultras) having mdns announcement failures under certain relatively unusual situations.

On Mon, Jun 22, 2020 at 9:42 AM Stephen Becker IV notifications@github.com wrote:

I have been running this in a docker on unraid to support my broken up network for chromecast ultra to stream plex. I have seen, often enough, that the audio for a stream for a video will not work. I thought it was an issue with chromecast to the speaker but this wall of text might also be my issue. unlike gjbadros i have just been restarting my chromecast and blaming sonos. I look forward to anything that comes out of this issue or to test any potential solutions.

Thanks again! Becker

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kennylevinsen/mdns-repeater/issues/6#issuecomment-647638743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALOHTK4FXZ3OGPJRJ5I5ELRX6CX5ANCNFSM4MZKEY7Q .

geekman / mdns-repeater

mdns-repeater causes ChromeCast Audio devices to cease broadcasting mDNS responses when interface MTU exceeded #6