TheThingsNetwork / lorawan-stack

The Things Stack, an Open Source LoRaWAN Network Server

https://www.thethingsindustries.com/stack/

Apache License 2.0

975 stars 306 forks source link

Cluster and Service discovery #138

Closed htdvisser closed 4 years ago

htdvisser commented 5 years ago

Summary:

It would be really nice if we could do some kind of cluster/service discovery.

Why do we need this?

It would be really helpful for users if they could get a list of public clusters to choose from.
Every cluster may have been deployed in a different way. There may be different domains, ports, TLS/non-TLS connections may or may not be available, etc.

What is already there? What do you see now?

Clusters are currently not registered, but there are ideas/plans to register clusters in the Identity Server: #143
We currently have default ports, and default configuration.

What is missing? What do you want to see?

Assuming that clusters will be registered in the Identity Server, we either need to put service information in there, or we need a different mechanism to discover service information.

How do you propose to implement this?

In https://github.com/TheThingsIndustries/lorawan-stack/issues/1131 I suggested to use DNS SRV records for service discovery.

A TTN cluster eu-west.thethings.network could be discovered through SRV records in DNS:

Field	Value
_service._proto.name.	`_ttn-v3-{gs,ns,as,js}-{grpc,http,mqtt}._{tcp,tls}.eu-west.thethings.network.`
TTL	`<ttl>`
class	`IN`
priority	`<priority>`
weight	`<weight>`
port	`8884`/`1884`/`443`/`80`/`8883`/`1883`/...
target	domain

_ttn-v3-gs-grpc._tls.eu-west.thethings.network. <ttl> IN <priority> <weight> 8884 eu-west.thethings.network.
_ttn-v3-ns-grpc._tls.eu-west.thethings.network. <ttl> IN <priority> <weight> 8884 eu-west.thethings.network.
_ttn-v3-as-grpc._tls.eu-west.thethings.network. <ttl> IN <priority> <weight> 8884 eu-west.thethings.network.
_ttn-v3-js-grpc._tls.eu-west.thethings.network. <ttl> IN <priority> <weight> 8884 eu-west.thethings.network.
_ttn-v3-gs-grpc._tcp.eu-west.thethings.network. <ttl> IN <priority> <weight> 1884 eu-west.thethings.network.
_ttn-v3-ns-grpc._tcp.eu-west.thethings.network. <ttl> IN <priority> <weight> 1884 eu-west.thethings.network.
_ttn-v3-as-grpc._tcp.eu-west.thethings.network. <ttl> IN <priority> <weight> 1884 eu-west.thethings.network.
_ttn-v3-js-grpc._tcp.eu-west.thethings.network. <ttl> IN <priority> <weight> 1884 eu-west.thethings.network.
...

Note that in this example, the SRV records only indicate a port mapping for the services exposed by the cluster (so these records will only have to be set once for a deployment). The target eu-west.thethings.network is assumed to be a load balancer in front of the cluster. Alternatively we could use the SRV records for load balancing.

We may have to expose some extra information if the GS has multiple UDP endpoints that may have different behavior (different frequency plan for example). This can be discussed later, as I think this is a pretty advanced use case.

What can you do yourself and what do you need help with?

Let's first think about cluster registration in the Identity Server: #143

johanstokking commented 5 years ago

We may have to expose some extra information if the GS has multiple UDP endpoints that may have different behavior (different frequency plan for example). This can be discussed later, as I think this is a pretty advanced use case.

I think we should have an idea for this because we'll use this quite in PCN. But do we need those endpoints as SRV records in the first place?

htdvisser commented 5 years ago

But do we need those endpoints as SRV records in the first place?

Do we need those endpoints to be registered in the first place: we do need to have a port mapping if non-default ports are used. Sure, we could specify that default port numbers are used if SRV records aren't present, but this would enable (encourage/not punish) client/sdk developers that take shortcuts by assuming that default ports are always used, which would result in those clients/sdks not supporting clusters that use non-default ports.

Do we need those endpoints as SRV records: No, we can also just register all endpoints in the Identity Server's Entity Registry, but then it becomes a CPOF like v2's Discovery Server. Or define a gRPC or HTTP endpoint (that is always on 80/443) for listing the ports for the services.

htdvisser commented 5 years ago

@johanstokking Let's revive this issue. As discussed, we should determine to what extent we can and want to align this with the LoRa Alliance DNS for JoinEUIs and NetIDs.

The LoRaWAN Backend Interfaces 1.0 specification (which I assume we still plan to support) specifies that for a JoinEUI of 00005E100000002F a server shall do a NAPTR lookup on f.2.0.0.0.0.0.0.0.1.e.5.0.0.0.0.joineuis.lora-alliance.org.

The result of this lookup is:

		order	pref	flags	service	regexp	replacement
IN	NAPTR	50	50	S	LWN		_lwn.operator.com
IN	NAPTR	90	50	S	LWNS		_lwn.operator.com

The flags indicate the next lookup to perform:

flag	next action
S	SRV lookup of _lwn.operator.com
A	A, AAAA or A6 lookup of _lwn.operator.com
U	Use _lwn.operator.com as URI
P	Additional NAPTR lookup

The service means:

service	meaning
LWN	LoRaWAN server using HTTP
LWNS	LoRaWAN server using HTTPS

I'm assuming that the LoRa Alliance isn't going to operate/administer all of those DNS records, so I think we'll get an NS record for our EUI prefix. This likely means that we will control 5.d.3.b.0.7.joineuis.lora-alliance.org and have to operate/administer it. I don't think it's a good idea to do this in our existing DNS server, so I looked into self-hosting authorative DNS servers. It turns out to be relatively easy to implement with github.com/miekg/dns which is also used by coredns (k8s) and skydns/consul (which we can also look into).

Since we're already planning to register clusters in the identity server (#143) I think it would be a good idea to add a list of JoinEUI prefixes to this registration. Our DNS servers could then periodically fetch the full list of clusters, their addresses and JoinEUI prefixes, and serve DNS records for these. We could even consider skipping this extra component and exposing DNS directly on the Identity Servers.

The DNS lookup described in the LoRaWAN Backend Interfaces specification ends with an IP address of the Join Server for a JoinEUI. I don't think this is sufficient for us, since we (1) want to use gRPC instead of the Backend Interfaces API and (2) need to know a port number to connect to. In my opinion the perfect way to publish this kind of information using DNS is to use SRV records as described in the original issue above.

johanstokking commented 5 years ago

The LoRaWAN Backend Interfaces 1.0 specification (which I assume we still plan to support) specifies that for a JoinEUI of 00005E100000002F a server shall do a NAPTR lookup on f.2.0.0.0.0.0.0.0.1.e.5.0.0.0.0.joineuis.lora-alliance.org.

NAPTR records are dropped in 1.1. That version is still in draft until members start implementing it (fully, including hand-over roaming) and confirm that they did not encounter issues. So that's going to take a while. Until then, we should focus on 1.1 and not spend time on functionality that we know becomes obsolete.

On top of that, as far as I know, members make little to no use of DNS lookup at the moment, partly because of this complexity, partly because they have out-of-band agreements anyway and partly because of little technical DNS support from the LoRa Alliance.

I'm assuming that the LoRa Alliance isn't going to operate/administer all of those DNS records, so I think we'll get an NS record for our EUI prefix. This likely means that we will control 5.d.3.b.0.7.joineuis.lora-alliance.org and have to operate/administer it.

DNS delegation is one of the topics that is continuously being pushed forward. So far, what's in draft 1.1, is CNAME and A records only. Let's not bring any assumptions to the equation at this moment.

Since we're already planning to register clusters in the identity server (#143) I think it would be a good idea to add a list of JoinEUI prefixes to this registration. Our DNS servers could then periodically fetch the full list of clusters, their addresses and JoinEUI prefixes, and serve DNS records for these. We could even consider skipping this extra component and exposing DNS directly on the Identity Servers.

The DNS lookup described in the LoRaWAN Backend Interfaces specification ends with an IP address of the Join Server for a JoinEUI. I don't think this is sufficient for us, since we (1) want to use gRPC instead of the Backend Interfaces API and (2) need to know a port number to connect to. In my opinion the perfect way to publish this kind of information using DNS is to use SRV records as described in the original issue above.

I suggest going with the following phases;

LoRaWAN Backend Interfaces 1.1 messages and A/CNAME DNS lookup. It's ugly but it's not broken; today, there is no technical reason to prefer our gRPC API (reason 1 above). According to (1.1) specification, there is no room for ports (reason 2 above); it's port 443 and the root path. Server is done (#117) and client is planned (#833 will get standard spec client with options)
The out-of-band configuration we need with phase 1 is authentication. We're pushing this in the TC to standardize and there are a few solution directions. Until then, this unfortunately has to become manual, for clients to keep a server CA and client certificate and key per JoinEUI prefix
If phase 2 takes too long, we may skip it and add DNS records for service discovery (SRV) and authentication (DANE), that gives us discovery of our gRPC endpoints and authentication for Backend Interfaces and gRPC

htdvisser commented 5 years ago

NAPTR records are dropped in 1.1. [...] Until then, we should focus on 1.1 and not spend time on functionality that we know becomes obsolete.

So does this mean we will not implement 1.0 at all?

DNS delegation is one of the topics that is continuously being pushed forward. So far, what's in draft 1.1, is CNAME and A records only. Let's not bring any assumptions to the equation at this moment.

Then I guess I was confused by the spec mentioning NS records:

The NetID will be provisioned in the zone “NETIDS.lorawan.net”. The resource corresponding to the NetID could be provisioned in different DNS resource record formats (such as NS, CNAME, A, AAAA).
[...]
Similarly, the JoinEUI could be provisioned in the zone “JOINEUIS.lorawan.net” with different DNS resource record formats based on the requirements  as follows:

I suggest going with the following phases [...]

I thought we previously already concluded that the first phase for the Backend Interfaces Join flow would be a configuration file or repository. I came up with something like this:

- name: Name of the Join Configuration
  prefixes:
  - 0000000000000000/00
  - 0000000000000000/00
  # in case of DNS lookup:
  dns:
    resolver: 1.1.1.1
    records: CNAME
  # in case of static config:
  static:
    host: hostname.tld
    port: 1234
  protocol: ttn.lorawan-stack.v3 # or backend-interfaces-1.0, backend-interfaces-1.1, ...
  # in case of basic auth:
  basic_auth:
    username: username
    password: password
  # in case of token auth:
  bearer_token: XXX
  # in case of TLS:
  tls_config:
    ca_file: ...
    cert_file: ...
    key_file: ...

We are however getting a bit off-topic for this issue. The goal of this issue is to come up with a mechanism for cluster discovery and for getting port+protocol configuration from domain name of a cluster deployment, so that (for example) the network_server_address of an end device registration can be resolved to the gRPC or HTTP endpoint of the Network Server.

I think it would be nice if we can do this in DNS and if it can be aligned with the DNS mechanism that is described in the Backend Interfaces spec. But I can also just start implementing all of this as RPCs in the Identity Server while we figure out if and how we want to expose this through DNS.

johanstokking commented 5 years ago

So does this mean we will not implement 1.0 at all?

We cherry pick from Backend Interfaces like other members. We don't do hand over roaming, we do (stateless) passive roaming in Packet Broker, we don't do 1.0 NAPTR records, we do 1.1 DNS lookup, we do support 1.0 and 1.1 messages for the flows that we implement, etc.

Then I guess I was confused by the spec mentioning NS records:

The NetID will be provisioned in the zone “NETIDS.lorawan.net”. The resource corresponding to the NetID could be provisioned in different DNS resource record formats (such as NS, CNAME, A, AAAA).
[...]
Similarly, the JoinEUI could be provisioned in the zone “JOINEUIS.lorawan.net” with different DNS resource record formats based on the requirements  as follows:

DNS delegation is certainly on the roadmap and 1.1 opens the door for it, but we don't have to operate/administer it (for now) so we don't have to set that all up. In practice, there's no support for it from LoRa Alliance nor Afnic (yet), so even if we would have that in place, we can't use it.

I thought we previously already concluded that the first phase for the Backend Interfaces Join flow would be a configuration file or repository. I came up with something like this [...]

Yes, that fits nicely with my phases 1 and 2 and should be part of #833 (cc @rvolosatovs)

We are however getting a bit off-topic for this issue. The goal of this issue is to come up with a mechanism for cluster discovery and for getting port+protocol configuration from domain name of a cluster deployment, so that (for example) the network_server_address of an end device registration can be resolved to the gRPC or HTTP endpoint of the Network Server.

I think it would be nice if we can do this in DNS and if it can be aligned with the DNS mechanism that is described in the Backend Interfaces spec. But I can also just start implementing all of this as RPCs in the Identity Server while we figure out if and how we want to expose this through DNS.

I just don't think it should be aligned to Backend Interfaces, if we go for the DNS approach.

Also, in (private) networks we need this cluster discovery as well, and it's going to be pretty hard to impose DNS there knowing some of their enterprise environments. So making this part of IS (and potentially keeping DNS records from there) may be the best way to go.

htdvisser commented 5 years ago

Blocked on #143

johanstokking commented 5 years ago

There's some groundwork in #1392 to at least fallback to the default ports if the target doesn't contain any.

The middleware introduced in pkg/rpcmiddleware/discover is intended to contain the implementation for this issue.

The proposal here contains SRV records per component, which means that discover.WithTransportCredentials() and discover.WithInsecure() should take a ttnpb.ClusterRole. I did not account for this yet, but this is not hard to add. It's just that callers may not reuse connections anymore and keep them separate per component. As long as we don't have a final solution with service discovery per component in place and we know exactly what we want, let's not prematurely account for that.

johanstokking commented 5 years ago

Additional groundwork in #1442 is the pkg/rpcmiddleware/discover.DialContext() that is going to discover services on the target and dial the right address with the right dial options.

We may want to consider adding all default dial options there, instead of requiring it to be set by the callers. This makes it also easier to make them variable based on discovered result.

johanstokking commented 4 years ago

2763 works, but the TLS server certificate is validated with the initial gRPC dial target, not the target dialed by the dialer via the discover dial option. So dialing SRV targets works, but TLS server certificate validation fails:

WARNING: 2020/06/19 21:59:32 grpc: addrConn.createTransport failed to connect to {0.0.0.0.0.0.0.d.e.7.5.d.3.b.0.7.join.thethings.industries  <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for join.cloud.thethings.industries, *.join.cloud.thethings.industries, not 0.0.0.0.0.0.0.d.e.7.5.d.3.b.0.7.join.thethings.industries". Reconnecting...

I'll try figuring out a way to do the SRV lookup not in the dialer but before that, hopefully as an alternative dial option, otherwise before dialing. Validating the peer certificate using another SRV lookup doesn't seem like a good option.

johanstokking commented 4 years ago

Closed by #2779