DoESLiverpool / somebody-should

A place to document practices on the wiki and collect issues/suggestions/to-do items for the physical space at DoES Liverpool
31 stars 11 forks source link

mqtt.local went down and didn't come back up cleanly after reboot #1210

Open amcewen opened 5 years ago

amcewen commented 5 years ago

Since yesterday the Liverbird hasn't been showing our energy usage.

Doing a bit of poking into it, I found that mqtt.local was offline. @ajlennon power-cycled it, which has brought it back up, but it's failing to connect to its influxdb instance.

amcewen commented 5 years ago

+1 for .local, it's fewer characters to type :-)

skos-ninja commented 5 years ago

Consider it done

ajlennon commented 5 years ago

So I've now taken a look and

I've remoted into the box through the OpenBalena instance it is connected to and it's running the avahi-daemon (m-dns) on the wrong hostname. We can ping this hostname.

I've posted this to the balena forum

Hi,

I’m running multiple containers on a RPIv3 on our network here.

The hostname is set to “mqtt” and when we power up I can ping mqtt.local

After a while (days+) we can no longer access mqtt.local

I’ve remoted in through the OpenBalena instance it is connected to and checked what is running.

I see the avahi-daemon is on a different hostname

754 avahi 5924 S avahi-daemon: running [mqtt-39.local] 770 avahi 5136 S avahi-daemon: chroot helper 4892 redsocks 4008 S avahi-daemon: registering [b356345-80720.local] 4893 redsocks 3472 S avahi-daemon: chroot helper Sure enough if I ping mqtt-39.local I get a response.

Can you point me in the right direction to understand why the daemon is configured with the extra -39 instead of the hostname?

Thanks!

Alex

https://forums.balena.io/t/mdns-local-access-to-device-failing-after-a-bit/27784

ajlennon commented 5 years ago

Did you make that change @skos-ninja ?

More specifically do we have a DNS configured .local domain?

I am reading now that the m-DNS support gets upset if we do !

(And the mqtt.local box is going bonkers as it thinks there are hostname conflicts everywhere)

ajlennon commented 5 years ago

NB. I know I asked you to do this! I think I made a mistake...

ajlennon commented 5 years ago

Thread of conversation on Avahi behaviour here

https://forums.balena.io/t/mdns-local-access-to-device-failing-after-a-bit/27784/17

goatchurchprime commented 5 years ago

I've found that the mDNS implementation for the espurna sonoff plugs is complete, since thing works:

ping ESPURNA-547CD9.local

It's a shame that mDNS works for the IoT ESP8266-based endpoint devices, but not for the main rPI broker.

johnmckerrell commented 4 years ago

Having been asked to debug this I had a poke around on the router and found a setting:

Register client hostname from DHCP requests in USG DNS forwarder: ON/OFF Which I found in Settings -> Services -> DHCP -> DHCP Server

That appears to be taking the hostname that was passed in the DHCP requests and returning it in DNS requests, and doing this for a long time after that device has disappeared. I've turned this off and Sams-iPhone.local now seems to have stopped working (which is correct) at least if I clear my cache.

johnmckerrell commented 4 years ago

Also:

WiFi b8:27:eb:cb:96:8c - was configured to be 10.0.100.1 in the DHCP - now configured to 10.0.100.2 Wired b8:27:eb:9e:c3:d9 - wasn't configured, now configured to be 10.0.100.1

johnmckerrell commented 4 years ago

And.. on further discussion I've removed those IP addresses from the DHCP configuration, but we can say that those two IP addresses are allocated to this purpose, so should be manually assigned on the box itself (doesn't matter to me if you don't use both of them but I'll record them as being for this purpose on the network documentation).

MatthewCroughan commented 4 years ago

Things that previously used .localdomain are essentially not able to be pinged by their hostname on the network it seems, as I can no longer reach them. This includes Alex's Octoprint instance. They used to coexist, which is quite strange in and of itself and shouldn't be possible, but now they do not. Devices that used localdomain are now unresponsive on anything but their ip.

This device was previously accessible at octopi.localdomain but is now only accessible at its IP at 10.0.39.51

# Generated by resolvconf
domain local
nameserver 10.0.0.1
nameserver 1.1.1.1
nameserver 1.0.0.1

My resolv.conf now shows domain local rather than domain localdomain which is default on a lot of Linux/FreeBSD systems. It may be true that however @ajlennon has his Pi setup with Balena or otherwise is permanently configured to use localdomain which is something that the network is no longer respecting.

I have no idea how the router could have anything to do with this other than DROPPING the packets that are related to .localdomain. I've reconfigured a bunch of my devices and they mostly all changed to local on their own.

MatthewCroughan commented 4 years ago

image

If we've put local in the domain field of whatever the equivalent of this setting is in our router software, we have definitely made a big mistake, as PFSense outlines in its general settings page.

Do not use 'local' as a domain name. It will cause local hosts running mDNS (avahi, bonjour, etc.) to be unable to resolve local hosts not running mDNS.

This is definitely the problem I'm observing, as I've had to install avahi-daemon on a bunch of machines that I did not previously.

Now, if this is true, we are in a situation where every device must install something equivalent to avahi-daemon despite the fact that the DNS Server on the router can resolve these just fine, without clients needing to have their own instance of avahi-daemon

Somewhere in base networking protocols, without avahi or mDNS hostnames are transferred to the router. If our domain is set to .local rather than something else like .localdomain or .lan it means we can't resolve hosts that aren't running mDNS.

If we set the domain to .local and have a device that's not running mDNS with a hostname of foo and it has obtained a DHCP lease from the router, meaning that the router now knows its hostname as configured on the device via some part of DHCP. If the router recieves a lookup for foo.local then it will return the ip address of foo.local successfully.

however if you try to look up foo.local from a separate device that is running mDNS via the avahi-daemon then it will fail to look up foo.local because the mDNS daemon is preferred. It will not be able to return foo.local's IP address, because foo.local is not running mDNS.

Not using .local in the router's domain set up avoids this scenario and allows all devices to find out information about their hostnames and supports devices that aren't running mDNS daemons, rather than not at all, as would be the case if we chose not to enforce .local as the router's domain, which PFSense, OpenWRT and more warn against.

MatthewCroughan commented 4 years ago

Now, what has ocurred is that you cannot ping hostnames unless you have an mDNS daemon installed on your system, and vice-versa. This is not the way it should be done and explains why all the devices that had .localdomain are no longer visible to even the router itself. All we have done is invalidate the utility of the router's DNS, as it can no longer report back a lookup to a hostname at all.

When you run dig and specify .local, it makes sure to make you aware that .local is reserved for Multicast DNS, mDNS is not supposed to be implemented or enforced by the router's domain.

; <<>> DiG 9.14.5 <<>> matt-octoprint.local @10.0.0.1
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 59060
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;matt-octoprint.local.          IN      A

;; Query time: 1 msec
;; SERVER: 10.0.0.1#53(10.0.0.1)
;; WHEN: Fri Sep 27 03:54:24 BST 2019
;; MSG SIZE  rcvd: 49

@goatchurchprime This is why .localdomain is a thing, or exists at all. So that devices without mDNScan still look up hostnames without mDNS.

skos-ninja commented 4 years ago

I can have a look at this later today but UniFi support rebroadcasting mDNS responses in order for them to still work in this case

johnmckerrell commented 4 years ago

"you cannot ping [local] hostnames unless you have an mDNS daemon" Given that's the whole point of mDNS I don't think there's anything particularly non-standard going on here. Having the router's internal hidden DNS proxy also happen to return results for things random people have told it on DHCP sounds a bit more non-standard but what do I know?

I've turned back on the DHCP results showing in DNS and set the network's domain to localdomain, I also tried does.localdomain but that didn't seem to work either. I'll leave it as is for now, maybe it'll only work for things when they renew their DHCP leases.

MatthewCroughan commented 4 years ago

@johnmckerrell What I meant to say is that you can't ping DNS (hostnames over DHCP leasing, was a thing before mDNS existed) if you use mDNS on your system. Which is a problem, since whatever has been changed means you can't:

ping mqtt ping mqtt.localdomain ping mqtt.local

UNLESS you have an mDNS daemon on your computer. And that will only respect .local, since it's mDNS. And if you are not running an mDNS daemon the network doesn't respond if the router domain is .local, because that's reserved for mDNS. When using a .local router domain, mDNS is all that can be used which means it will no longer respect, lookup or return non-mDNS hostnames for reasons I'm not 100% aware of, but something about conflicts.

mDNS is not the only way of getting a hostname, it's fairly modern and it just makes things easier when it's added onto a network. By using .local for the router domain it makes it impossible to use regular DNS for hostnames. This is why localdomain exists as a convention.

A person was trying to use Alex's printer earlier but couldn't because octopi.localdomain is no longer accessible, because he's running an mDNS daemon, and mDNS doesn't see .localdomain, but if the router domain was anything else, be it .localdomain or .lan, it would return the address of that machine regardless since the mDNS would failover to the gateway's DNS resolver, which of course knows about it because of its DHCP lease.

Devices that do not have an mDNS daemon cannot participate their hostnames on the network in this configuration.

No mDNS daemon on your system = can't see anything mDNS = Can only see .local

octopi.localdomain works even if the device is not running avahi, because hostnames are transferred via DHCP without any mDNS functionality, which is great. This functionality is made impossible when the router domain is .local as PFSense and OpenWRT outline.

johnmckerrell commented 4 years ago

@MatthewCroughan given you were talking about ARP records yesterday it seems like this is new knowledge to you too. I have already made the changes to mostly re-enable what we had previously just with a network domain of localdomain rather than the conflicting local and did so before your recent comments. Can you maybe now wait until you've been able to test before trying to teach me about this?

MatthewCroughan commented 4 years ago

@johnmckerrell I'm not trying to teach you about anything. I've just been discussing it all night with a friend online and am coming to realise why localdomain is a thing. I'll curb the enthusiasm, sorry :)

The arp record comment yesterday was made before reading into any of this, or looking at my own PFSense and reading their documentation on how mDNS, caching options and more work. The Ubiquiti firmware looks like it has way more niche and non-standard features though, so there's probably a million things that are going on that I have on idea about.

amcewen commented 4 years ago

@johnmckerrell, you said:

WiFi b8:27:eb:cb:96:8c - was configured to be 10.0.100.1 in the DHCP - now configured to 10.0.100.2 Wired b8:27:eb:9e:c3:d9 - wasn't configured, now configured to be 10.0.100.1

Does that mean that mqtt.local should be resolving to one (or both) of those IP addresses? At present neither of those IP addresses is responding to pings, and it seems to be resolving to 10.0.30.194 at the moment?!?

$ ping mqtt.local
PING mqtt.local (10.0.30.194) 56(84) bytes of data.
64 bytes from 10.0.30.194 (10.0.30.194): icmp_seq=1 ttl=64 time=3.86 ms
64 bytes from 10.0.30.194 (10.0.30.194): icmp_seq=2 ttl=64 time=6.38 ms
64 bytes from 10.0.30.194 (10.0.30.194): icmp_seq=3 ttl=64 time=2.50 ms
64 bytes from 10.0.30.194 (10.0.30.194): icmp_seq=4 ttl=64 time=5.46 ms
johnmckerrell commented 4 years ago

@amcewen I also said "And.. on further discussion I've removed those IP addresses from the DHCP configuration"

It seemed like the box was statically configured and to help with portability elsewhere we thought that would be best, but it seems like it might not be the case.

ajlennon commented 4 years ago

It seemed like the box was statically configured and to help with portability elsewhere we thought that would be best, but it seems like it might not be the case.

Not by me. @goatchurchprime? @MatthewCroughan ?

MatthewCroughan commented 4 years ago

@ajlennon @johnmckerrell Are we saying that there's a box somewhere with an mDNS Daemon mqtt.local that is statically configured, that is not @ajlennon's balena pi that we're otherwise not aware of?

johnmckerrell commented 4 years ago

No, I don't think so.

WiFi b8:27:eb:cb:96:8c - was configured to be 10.0.100.1 in the DHCP - now configured to 10.0.100.2 Wired b8:27:eb:9e:c3:d9 - wasn't configured, now configured to be 10.0.100.1

I think when we looked, the wired interface had 10.0.100.1, and the WiFi one was trying to get it and having issues so we figured that the wired one was manually configured. It seems like that might not be the case?

ajlennon commented 4 years ago

Historically mqtt.local has changed its IP address - I think you found this @amcewen

My understanding is that it's changed its IP address again.

My belief is that it is picking up an IP address from the DHCP server on the network unless somebody else has been in there and changed things around.

I can double check this tomorrow.

johnmckerrell commented 4 years ago

The 10.0.100.1 & 2 ones have been allocated for this use (stuck on a wiki) so please do use them. Or I can put them back into the dhcp settings if we prefer.

-- Sent from my mobile phone hence brevity and errors

On 1 Oct 2019, at 22:23, Alex Lennon notifications@github.com wrote:

 Historically mqtt.local has changed its IP address - I think you found this @amcewen

My understanding is that it's changed it's IP address again.

My belief is that it is picking up an IP address from the DHCP server locally unless somebody else has been in there and changed things around.

I can double check this tomorrow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

MatthewCroughan commented 4 years ago

@johnmckerrell this caching issue is happening again. image

MatthewCroughan commented 4 years ago

image

Despite the fact that the Pi is running avahi-daemon, it is not returning .local

I believe this is because whatever this feature is, it prevents mDNS discovery when a .localdomain addr is cached. I really hope this can get solved.

Whatever the case, not providing .local or .localdomain when interacting and only providing the hostname seems to work. ping ender3-octoprint will still work, which is all that matters.

amcewen commented 4 years ago

An additional datapoint...

I haven't had any problems talking to a number of Pis with my Museum in a Box stuff over the past week or two. They're all configured with a hostname of box - there's the one on the bookcase by the main door, which has been up for 9 days now - and then three more Pis which have been on and off repeatedly while I've been testing things (although only one of them on at any one time, but I've been switching between them lots)

I haven't had any problems talking to them with ssh pi@box.local and ssh pi@box-2.local, and similarly talking to the Node RED instances in a browser. The one on the bookcase has been both box.local and box-2.local at various points, but the other Pi (the one I've been trying to contact during the testing) has always responded at the other name. I basically run uptime when I've logged in to double-check I'm on the right Pi.

I don't ever try connecting to them without the .local bit, and haven't ever tried .localdomain until just now, when it worked fine.

ajlennon commented 4 years ago

OK so I have restarted mqtt.local with only the wired interface supported. It appears to be responding to mqtt.local on the expected IP address

johnmckerrell commented 4 years ago

I can ping ender3-octoprint.local, also .localdomain, I can't ping it without those because then it tries to resolve to my work vpn network.

MatthewCroughan commented 4 years ago

@johnmckerrell My understanding is that if you have an avahi-daemon running, /etc/resolv.conf is going to be pointing to some sort of private network which is the avahi-daemon. If that fails it'll then query the router DNS to see if the machine exists (the default if you don't have an avahi-daemon). The problem is that the ubiquiti feature I think is masking .local some of the time for the same reason it sometimes provides the wrong hostname.

johnmckerrell commented 4 years ago

3.14. Host Name Option This option specifies the name of the client. The name may or may not be qualified with the local domain name

Well all I'm wondering is if the device is telling the router that it is foo.local and the router is then reporting this back, but I'm unclear on whether the documentation above (from the RFC) just means "when you later try to use this hostname it may or may not be qualified with the local domain name" or does it mean "you can pass a domain name in with the hostname". I would expect the former really.

Just to confirm, the router has its domain set to localdomain so it "shouldn't" be trying to do anything with the .local domain, unless as it says it is being told this by things requesting DHCP leases and then reporting that back out.

MatthewCroughan commented 4 years ago

@johnmckerrell My understanding is that outside of mDNS the device requests an IP and gives a hostname. The hostname that is given is usually specified in /etc/hosts like so:

127.0.0.1       localhost.local localhost       thinkpad
::1             localhost.local localhost       thinkpad

If I chose to request localhost.lan then ping thinkpad.lan should respond with the ip of my machine.

My theory is that this is the first thing that the router's feature caches, in the same way that Sams-Iphone.localdomain was causing a problem, it is returning .localdomain some of the time rather than allowing mDNS responses all of the time if both parties have an mDNS daemon.

This might still come down to your personal machine's configuration too. Since theoretically the mDNS daemon should be the first query, then the router's dns, but this may not be happening everywhere.

MatthewCroughan commented 4 years ago

@johnmckerrell After following this, I've got it working on my laptop. For some reason mqtt.local now returns an ipv6 address, whereas I believe I saw on @amcewen's machine it returns an ipv6 address. It all comes down to one's client configuration, which is actually really disappointing since it seems to vary so much between even two installations of Ubuntu.

https://unix.stackexchange.com/questions/43762/how-do-i-get-to-use-local-hostnames-with-arch-linux

The configuration in question is in /etc/nsswitch.conf

Configuration before following the guide: hosts: files mymachines myhostname resolve [!UNAVAIL=return] dns Configuration after following the guide, fixes it: hosts: files mdns_minimal [NOTFOUND=return] dns myhostname

MatthewCroughan commented 4 years ago

MDNS also works just fine on the Vinyl Cutter pc, though there is some strange behaviour that I think is related to the wifi.

Discovery of mDNS on the vinyl cutter pc is strangely intermittent. I can't recreate it exactly, but I did observe it.

If I execute ping mqtt.local it will take some time (around 5 seconds) to resolve it. This will sometimes fail. Though after succeeding once it has no issue resolving subsequently. It will fail to resolve if the system were brought out of hibernation, but will work if you probe it enough.

This failure to resolve and massive resolve delay is not true of pinging the IP address of the machine directly, so it's definitely an mDNS related issue, whether that's down to configuration or the wifi hardware being slow. I do notice that the system has a massively variant ping response time when pinging local addresses. Pinging the router will result in anywhere from 10ms to 262ms.

The configuration of /etc/nsswitch.conf on that machine which is a fresh Ubuntu 19.04 is: hosts: files mdns4_minimal [NOTFOUND=return] dns

and it returns ipv4 addresses for all .local addresses. This is due to mdns4_minimal, as I tried switching it to mdns_minimal. I later discovered that this obviously means ipv4 explicitly.

https://askubuntu.com/questions/843943/how-to-replace-mdns4-minimal-with-bind

This gives us all the details related to what the different possible configurations are.

MatthewCroughan commented 4 years ago

I've checked on Arthur's Win10 laptop, and it also seems to work. It returns Ipv6 addresses. The same was not true however of my Win10 virtual machine until I enabled the avahi-daemon on the host machine, which is very interesting to me, not sure I understand what's happening there.