Open nihn opened 7 years ago
Thanks for reporting this. It looks like other systems have run into this nifty behavior as well. It's unclear to me if it's also possible to work around the problem by hacking /etc/gai.conf.
https://github.com/weaveworks/weave/issues/1245 https://github.com/hashicorp/consul/issues/1481
Comments suggest that the latest RFC fixes the problems with the sorting as per the spec but that getaddrinfo implementations have been slow to adopt the latest RFC.
https://tools.ietf.org/html/rfc6724 (the latest spec on record sorting)
Consul, in particular, implemented the workaround as suggested by the OP here. It would be useful to understand which Linux distributions are affected by this.
On Wed, Aug 2, 2017 at 5:28 AM, Mateusz Moneta notifications@github.com wrote:
Hello,
we use MesosDNS for loadbalancing some of our traffic (e.g. between real loadbalancers) inside the Mesos/Marathon cluster. We run into issue that often one of the loadbalancers has much more traffic than others. This is caused by getaddrinfo system call behavior which sorts records got from MesosDNS and returns always same record. It would be nice to have feature to restrict number of answers returned by MesosDNS to one random record. Without it services started at the same time (which usually happens with Marathon when you restarts all of your application tasks at the same time) always use the same IP for other services they communicating with.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mesosphere/mesos-dns/issues/507, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPVLCf6LuqXfXHOEtH54U0IpMcf_tLgks5sUEFbgaJpZM4Oq1DI .
xref #485
@jdef any news on your side?
Not yet, stay tuned...
additional commentary re: libc implementations here http://www.zytrax.com/books/dns/ch9/rr.html
Hello,
we use
MesosDNS
for loadbalancing some of our traffic (e.g. between real loadbalancers) inside the Mesos/Marathon cluster. We run into issue that often one of the loadbalancers has much more traffic than others. This is caused bygetaddrinfo
system call behavior (forced by RFC3484) which sorts records got fromMesosDNS
and returns always same record. From documenation:It would be nice to have feature to restrict number of answers returned by
MesosDNS
to one random record. Without it services started at the same time (which usually happens with Marathon when you restarts all of your application tasks at the same time) always use the same IP for other services they communicating with.