Default interface might not be useful

hedss commented 7 years ago

There is an issue where, currently, should the default interface not be one a resinOS device is connected to, it will never show up.

We can fix this by ensuring that every interface has a Bonjour instance to bind against.

hedss commented 7 years ago

I've spent a considerable amount of time looking into this, and there are several issues.

MDNS (obviously) uses Multicast groups to enable the listening/transmission of traffic for all nodes in a local network. We use what is now the fork of bonjour because it appeared to be a full implementation of MDNS and DNS-SD under Node. Several issues have arisen since that have required patching, but ultimately this module itself relies on the multicast-dns module for all MDNS operations. Until now, I've not really looked at it.

The comment for the interface parameter for the constructor of multicast-dnsis: interface: '192.168.0.2' // explicitly specify a network interface. defaults to all

This appears to be a misunderstanding of the documentation for Node's socket.bind() call (which specifies that if no address is given, it will listen to all). Whilst true for Unicast binds, this is not true for Multicast binds where should no specific interface address be given, it will use INADDR_ANY (see the uv__udp_set_membership4 function in libuv which is the underlying platform library NodeJS uses). It is essentially up to the OS to bind to the first interface it deems suitable; see here, which specifies: You can always fill this last member with the wildcard address (INADDR_ANY) and then the kernel will deal with the task of choosing the interface.

The ultimate upshot is that, if multiple interfaces are being used on a host, rdt scan will only scan the primary interface (as it is this the OS will use for Multicast group membership).

So whilst this is not ideal, it initially does not seem a massive problem, as we could create multiple instances of the bonjour module with specific interfaces assigned. Unfortunately, it gets trickier here depending under what operating system you're using.

Under OSX (which I primarily develop on), we have Apple's mDNSResponder process which is launched on startup and immediately binds to the MDNS multicast port. Whilst it correctly uses the SO_REUSEPORT socket option to allow other processes to bind to port 5353 (the MDNS port), it's essentially run as a root process, so the only way we can share the port in a specific interface case is to also run as root (non-ideal) or use INADDR_ANY (the default for multicast-dns) which gets the kernel to take care of it by assigning to the primary interface.

Under Linux, there's a similar issue with the avahi-daemon, should it be running.

However, even disabling mDNSResponder/avahi-daemon, there still appears to be an issue in the underlying socket code which allows an interface to bind and send data to the correct Multicast group, but is not receiving any data back from it.

To test this, I created a very small test programme (CoffeeScript incoming...):

mcast = require('multicast-dns')

mdns = mcast({ interface: '192.168.2.101' }) # A secondary NIC

mdns.on 'query', (packet, rinfo) ->
    console.log(packet)
    console.log(rinfo)

mdns.on 'response', (packet, rinfo) ->
    console.log(packet)
    console.log(rinfo)

mdns.on 'error', (error) ->
    console.log(error)

mdns.query([ { name: '_resin-device._sub._ssh._tcp.local', type: 'PTR' } ])

This uses the multicast-dns module to send a query for any resinOS devices on the local network of which '192.168.2.101' is a member (this is a secondary interface on the Linux machine I'm experimenting on).

I used tcpdump on the Linux machine, and 'Wireshark' on an independent machine also connected to that network, to monitor for MDNS traffic. When running the test programme, I see the following from both machines:

12:04:43.302900 IP 192.168.2.101.mdns > 224.0.0.251.mdns: 0 PTR (QM)? _resin-device._sub._ssh._tcp.local. (52)
12:04:43.383737 IP 192.168.2.102.mdns > 224.0.0.251.mdns: 0*- [0q] 5/0/0 PTR resin._ssh._tcp.local., (Cache flush) TXT "", (Cache flush) SRV resin.local.:22222 0 0, (Cache flush) AAAA fe80::fb40:7c6f:6e5f:90f9, (Cache flush) A 192.168.2.102 (149)

So, the MDNS query is getting out onto the wire, and the local resinOS device is responding correctly. What isn't happening is the multicast-dns module seeing either the response or the original query (which it should).

I can't see anything wrong with the event handler in the module, so I'm coming to the conclusion that there's some issue further down, possibly in the way libuv handles multicast traffic.

Verification with netstat seems to suggest that mutlicast responses are being dropped by the kernel (which suggests that indeed the group is sending but not correctly listening).

Unfortunately, the current situation is that this is going to take a considerable amount more effort to solve, and is probably going to get into the realms of writing some C using the raw BSD libraries (and although I've experience with this, it is going to also mean poking around in libuv) to track down what's going on.

A short-term, though not ideal, answer here is to ensure that the interface that a resinOS device is on is the primary network interface. I have tested this under both OSX and Ubuntu and can confirm that prioritising an interface in this manner will allow rdt scan to find them correctly.

hedss commented 7 years ago

Further to this, I now have some C using the sockets library to successfully join and send/receive data from the MDNS multicast group on two separate interfaces, both correctly sending a query and a response. I have an idea as to how we might proceed forward, but it's probably going to involve patching the multicast-dns client.

hedss commented 7 years ago

The resin-io-modules/resin-discoverable-services#find-on-all-interfaces branch now includes the forked resin-io-modules/multicast-dns branch that includes experimental code to perform this.

In fact, it probably just needs to be cleaned up.

balena-io-modules / resin-discoverable-services

Default interface might not be useful #21