d2iq-archive / mesos-dns

DNS-based service discovery for Mesos.
https://mesosphere.github.com/mesos-dns
Apache License 2.0
483 stars 137 forks source link

Fault Domains awareness #518

Open vixns opened 6 years ago

vixns commented 6 years ago

Mesos 1.5.0 introduced regions and zones.

http://mesos.apache.org/documentation/latest/fault-domains/

mesos-dns could return them when defined as subdomains in SRV responses .

As an exemple,

_chronos._tcp.marathon.mesos. 10 IN SRV 0 0 8080 chronos-n7z8k-s1.marathon.mesos.

could become

_chronos._tcp.marathon.mesos. 10 IN SRV 0 0 8080 chronos-n7z8k-s1.marathon.zone.region.mesos.

or, to preserve the framework.domain naming convention,

_chronos._tcp.marathon.mesos. 10 IN SRV 0 0 8080 chronos-n7z8k-s1.zone.region.marathon.mesos.

It also may be useful to introduce regions and zones within the mesos-dns configuration, allowing to configure an instance of mesos-dns to restrict responses to tasks running in the configured region/zone(s).

jdef commented 6 years ago

Thanks for filing this issue. TL;DR this is easier said than done.

The Mesos design spec makes specific recommendations w/ respect to conventions for sub-regions and sub-zones. In order to accommodate such, mesos-dns would probably need to transform the region/zone strings. This has already proved to be problematic: mesos-dns already transforms task name strings based some (old) assumptions that tasks are mostly launched by Marathon. We basically can't change it now because we'll break people that depend on the transformation. I'd hate to introduce another transform-dependency like that into the code base.

Also, it's easier to avoid name collisions when "growing" a string by prefixing/suffixing labels vs. inserting new labels to the middle. We'd probably need to come up with a coding scheme that avoids breaking people, upon upgrades, that want to make use of this but don't want to rewrite all of their existing scripts that depend upon current naming conventions. Solutions for this will probably run into another form of the "transformation problem" described in the first paragraph.

Another, completely different approach, would be to serve TXT records that present additional metadata to interested clients. It's probably less convenient (in some respects) than the proposal originally submitted by the OP but is has the advantage of being completely opt-in and backwards compatible. It wouldn't tell you which tasks are in which zone/region but it WOULD tell you which zone/regions the service is running in.

For guidance here it would probably help to enumerate a concrete list of use cases / workflows that justify exposing the fault zone/region via DNS records.