round_robin_upstreams: 0, improve algorithm to handle server idle timeouts

abelbeck commented 6 years ago

Our project has added getdns/stubby, working quite well. Though, we discovered an issue.

This issue is with round_robin_upstreams: 0 and a benign lookup resulting in NXDOMAIN causes the logic to switch to the next (secondary) server.

As I understand the docs, a server switch should only occur when the server is "unavailable" with round_robin_upstreams: 0.

This causes a performance "hiccup" with the server switch and then sits on the next (secondary) server which is not always desirable.

pbx4 ~ # cat /etc/stubby/stubby.yml

resolution_type: GETDNS_RESOLUTION_STUB
dns_transport_list:
  - GETDNS_TRANSPORT_TLS
tls_authentication: GETDNS_AUTHENTICATION_REQUIRED
tls_query_padding_blocksize: 128
edns_client_subnet_private: 1
idle_timeout: 10000
listen_addresses:
  - 127.0.0.1@2853
round_robin_upstreams: 0
upstream_recursive_servers:
  - address_data: 2606:4700:4700::1111
    tls_port: 853
    tls_auth_name: "cloudflare-dns.com"
  - address_data: 1.1.1.1
    tls_port: 853
    tls_auth_name: "cloudflare-dns.com"

pbx4 ~ # stubby -l (inline --> commands from another terminal)

[15:18:45.799452] STUBBY: Read config from file /etc/stubby/stubby.yml
[15:18:45.801691] STUBBY: DNSSEC Validation is OFF
[15:18:45.801772] STUBBY: Transport list is:
[15:18:45.801837] STUBBY:   - TLS
[15:18:45.801916] STUBBY: Privacy Usage Profile is Strict (Authentication required)
[15:18:45.801951] STUBBY: (NOTE a Strict Profile only applies when TLS is the ONLY transport!!)
[15:18:45.802015] STUBBY: Starting DAEMON....

--> pbx4 ~ # host www0.dnsprivacy.org
--> Host www0.dnsprivacy.org not found: 3(NXDOMAIN).

[15:18:58.437104] STUBBY: 2606:4700:4700::1111                     : Conn opened: TLS - Strict Profile
[15:18:58.514237] STUBBY: 2606:4700:4700::1111                     : Verify passed : TLS
[15:19:07.615742] STUBBY: 2606:4700:4700::1111                     : Conn closed: TLS - Resps=     3, Timeouts  =     0, Curr_auth =Success, Keepalive(ms)= 10000
[15:19:07.615861] STUBBY: 2606:4700:4700::1111                     : Upstream   : TLS - Resps=     3, Timeouts  =     0, Best_auth =Success
[15:19:07.615880] STUBBY: 2606:4700:4700::1111                     : Upstream   : TLS - Conns=     1, Conn_fails=     0, Conn_shuts=      1, Backoffs     =     0

--> pbx4 ~ # host www0.dnsprivacy.org
--> Host www0.dnsprivacy.org not found: 3(NXDOMAIN).

[15:19:13.982237] STUBBY: 1.1.1.1                                  : Conn opened: TLS - Strict Profile
[15:19:14.080672] STUBBY: 1.1.1.1                                  : Verify passed : TLS
[15:19:21.231409] STUBBY: 1.1.1.1                                  : Conn closed: TLS - Resps=     1, Timeouts  =     0, Curr_auth =Success, Keepalive(ms)= 10000
[15:19:21.231495] STUBBY: 1.1.1.1                                  : Upstream   : TLS - Resps=     1, Timeouts  =     0, Best_auth =Success
[15:19:21.231537] STUBBY: 1.1.1.1                                  : Upstream   : TLS - Conns=     1, Conn_fails=     0, Conn_shuts=      1, Backoffs     =     0

This is quite reproducible, but if successful lookups occur in the mix then the switch does not always occur. So testing on a "quiet" box is best.

Possibly an initialization issue in the "next server" state machine ?

saradickinson commented 6 years ago

What I notice in the log is that it appears the far end is shutting the connection and a quick test shows that Cloudflare seems to be using a 7s idle timer and they shut the connection if it is idle for that long (at least for me). The number of remote connection shutdowns is used in the server selection algorithm which is what I believe triggering this for this low number of total responses from the server.

I use a 5s idle timer idle_timeout: 5000 to ensure Stubby always closes the connection then this keeps Stubby using the same server. Please test this and see if it works for you. (Note, I didn't see any difference when using queries that got answers compared to NXDOMAIN, it could just be the response times effecting whether the 7s limit is hit or not).

abelbeck commented 6 years ago

@saradickinson Yes, idle_timeout: 5000 keeps Conn_shuts= from happening and the first (primary) server remains selected. Thanks !

BTW, Quad9 appears to use a 12-13 second idle timeout at their server end. I can reproduce the same issue with Quad9 by using idle_timeout: 15000. Additionally, I seem to recall Cloudflare used a 1-2 second idle timeout at the server end a week or two ago.

IMHO, reducing to idle_timeout: 5000 is not ideal, our user's would like to use 10 seconds or more if the server supports it, and the end-user should not have to tweak idle_timeout based on the provider. Possibly getdns's heuristic for non-zero Conn_shuts= and round_robin_upstreams: 0 could be tweaked ?

Final question, with round_robin_upstreams: 0 if the next server is selected, is there any sort of configurable timer for stubby to retry the first server, or is it stuck on the 2nd server until a "next server" event occurs again ?

saradickinson commented 6 years ago

@abelbeck I've seen that with Quad9 too but didn't know Cloudflare has used such a small value recently.

I agree that the current algorithm is rather fixed and uses some arbitrary parameters and could be improved. I know that @wtoorop is completely re-working the way upstreams are managed at the moment for the next release and so it probably makes sense to look at what (if anything) has changed there for this logic and implement improvements there.

As for the current behaviour I don't believe there is a timer, the intention of that option is to use one server until it is unavailable. I guess I'm interested why you don't want to use round_robin_upstreams: 1 which is much more flexible and robust to trying failed servers.

abelbeck commented 6 years ago

I'm interested why you don't want to use round_robin_upstreams: 1 which is much more flexible and robust to trying failed servers.

@saradickinson In my testing I found round_robin_upstreams: 0 offers the best performance to anycast servers like Quad9 and Cloudflare. With round_robin_upstreams: 1 the number of TLS setups is multiplied by the number of upstream_recursive_servers versus the value of 0. Also anecdotally, the resolver caching of these anycast providers seems better with round_robin_upstreams: 0.

Our project is a PBX/router/firewall at a fixed point, so typical mobile/roaming issues do not apply, thereby failed anycast servers is not a general problem.

Additionally, with a native IPv6 endpoint, the IPv6 servers with either Quad9 or Cloudflare offer the lowest latency for my location, so I want to prefer a primary IPv6 DNS-TLS server and then fallback to a secondary IPv4 DNS-TLS server if the upstream IPv6 path has a hiccup. Then return to the primary server after a period of time.

IMHO, I would suggest a mobile device should probably use round_robin_upstreams: 1 but a fixed point edge router device offers the best performance for its clients with round_robin_upstreams: 0 . As such the getdns/stubby heuristic for round_robin_upstreams: 0 could be improved with the scenario described above.

saradickinson commented 6 years ago

@abelbeck Thanks very much for the feedback. I've updated the title to reflect what needs doing on this issue. We might want to consider an explicit option for preferring IPv6 over IPv4....

abelbeck commented 6 years ago

Thanks @saradickinson

We might want to consider an explicit option for preferring IPv6 over IPv4....

I would suggest the order of the servers is the only preference needed ... the user can choose what is best for them. Although adding an option (say) reselect_primary_server: 3600 would reset the next-server list to the first server every 3600 seconds for round_robin_upstreams: 0 would be good.

@wtoorop As a test I hacked at the stub.c code and removed upstream->conn_shutdowns++; from upstream_failed() and the specific problem discussed here want away, though not a solution. From reading the code it seems a client initiated connection close is "cleaner" than a upstream server timeout close.

@saradickinson Possibly you can encourage DNS-TLS providers to use longer idle-timeouts at the server end so a default stubby idle_timeout: 10000 or even idle_timeout: 20000 would still allow client-side idle connection closing.

john9527 commented 6 years ago

One of my users has reported this issue following adding stubby support to my router Asuswrt-Merlin LTS fork. Just adding a note that I am willing to take early releases for test, and have 'early release' users of my router support that would also be willing to help out.

earthsojourner commented 5 years ago

I thought I would chime in, at the moment I am seeing Quad9 use an idle timeout of 1500-2000ms which is much shorter than previously noted. I was seeing the same behavior with stubby cycling servers until I set my idle_timeout value below 2000ms.

I am wondering if it would be worthwhile adding a note in stubby.yml.example explaining that stubby will cycle servers when round_robin_upstreams: 0 is set and idle_timeout is set to a value longer than a given upstream server has configured on the backend.

I haven't played around this too much, but I am also seeing DNS lookups fail when the above condition is hit with only one upstream configured during the configured tls_backoff_time. Again, I believe this is all expected behavior but it might be worthwhile noting this in stubby.yml.example so folks understand the expected behavior.

Anyway, just thought I would pass along the suggestion.

saradickinson commented 5 years ago

@earthsojourner Thanks for the note - I will add something into the configuration to warn folks of this issue!

saradickinson commented 5 years ago

@earthsojourner I've added some text in https://github.com/getdnsapi/stubby/pull/180

earthsojourner commented 5 years ago

@saradickinson Cool, thanks!

a7i commented 4 years ago

Hoping to follow up @abelbeck 's last question:

Final question, with round_robin_upstreams: 0 if the next server is selected, is there any sort of configurable timer for stubby to retry the first server, or is it stuck on the 2nd server until a "next server" event occurs again ?

Is there a way to configure this so Stubby would go back to the top of the list to re-attempt connectivity with the healthy server(s)?

I am observing a behavior that if the if the 2nd server also fails, Stubby never attempts to reselect a healthy server (in this case the first one).

Kaan88 commented 3 years ago

There should be a way to make stubby use the first server again. In my case the second server is just a backup, once the first server becomes online stubby should switch to it. This is how windows primary, secondary dns works by the way.

saradickinson commented 3 years ago

@Kaan88 agreed - this would be a good optimisation....

getdnsapi / stubby

round_robin_upstreams: 0, improve algorithm to handle server idle timeouts #105