[bug] DoT - TCP reset on egress

ghost commented 4 years ago

Version 1.9.6
Configure line: --target=arm-openwrt-linux --host=arm-openwrt-linux --build=x86_64-pc-linux-gnu --program-prefix= --program-suffix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --libexecdir=/usr/lib --sysconfdir=/etc --datadir=/usr/share --localstatedir=/var --mandir=/usr/man --infodir=/usr/info --disable-gost --enable-allsymbols --enable-subnet --with-ldns=/home/beast/beast/workspace/turris-os-packages-master-omnia/build/staging_dir/target-arm_cortex-a9+vfpv3_musl_eabi/usr --with-libexpat=/home/beast/beast/workspace/turris-os-packages-master-omnia/build/staging_dir/target-arm_cortex-a9+vfpv3_musl_eabi/usr --with-ssl=/home/beast/beast/workspace/turris-os-packages-master-omnia/build/staging_dir/target-arm_cortex-a9+vfpv3_musl_eabi/usr --without-pthreads --enable-tfo-server --enable-tfo-client
Linked libs: pluggable-event internal (it uses select), OpenSSL 1.1.1d 10 Sep 2019
Linked modules: dns64 subnetcache respip validator iterator
TCP Fastopen feature available

Whilst DNS queries with DoT are being resolved a tcpdump exhibits a plenitude of unhealthy egress connection resets

Flags: 0x004 (RST)

.... .... = Reserved: Not set ...0 .... .... = Nonce: Not set .... 0... .... = Congestion Window Reduced (CWR): Not set .... .0.. .... = ECN-Echo: Not set .... ..0. .... = Urgent: Not set .... ...0 .... = Acknowledgment: Not set .... .... 0... = Push: Not set .... .... .1.. = Reset: Set [Expert Info (Warning/Sequence): Connection reset (RST)] [Connection reset (RST)] [Severity level: Warning] [Group: Sequence] .... .... ..0. = Syn: Not set .... .... ...0 = Fin: Not set [TCP Flags: ·········R··]

for each upstream node being set in

forward-zone: name: "." forward-addr:

tcpdump

The upstream node's TCP port does not matter, 443 or 853, and neither the TLS version, 1.2 or 1.3.

gthess commented 4 years ago

I tried to reproduce this (on a linux machine) to no avail. It seems to be openwrt and/or target system related but I don't have one available at the moment to check.

Is the plenitude of TCP resets you are getting on openwrt only for unbound traffic? Did you compile unbound yourself? If so can you try again without the TCP fastopen feature? If it is TCP fastopen related, then the kernel (older version) maybe doesn't handle the TCP connection correctly; unbound just passes the option to the kernel.

ghost commented 4 years ago

I tried to reproduce this (on a linux machine) to no avail.

what is the net.ipv4.tcp_fastopen setting on that machine? On this node it is

TCP Fast Open 0 - disabled 1 - enabled on outgoing connections (client) 2 - available on listening sockets (server) 3 - enables both client & server)

net.ipv4.tcp_fastopen = 3

Is the plenitude of TCP resets you are getting on openwrt only for unbound traffic?

with the tcp.flags.reset == 1 filter set for analysing the dump the unbound generated traffic is the sole application exhibiting the issue - as depicted in the screenshot from the previous post.

Did you compile unbound yourself?

No

If so can you try again without the TCP fastopen feature?

Subsequently I am afraid not.

If it is TCP fastopen related, then the kernel (older version) maybe doesn't handle the TCP connection correctly; unbound just passes the option to the kernel.

I would reckon this not being an issue, though cannot be ruled out of course; whilst kernel 4.19.93 is not fresh off the mainline it still not that sort of legacy.

Not sure what/how else to contribute as info for a potential debugging effort. Got a few bandwidth testing apps on the node that generate egress traffic and could only speculate whether such apps leveraging FastOpen TCP but being bandwidth testing apps one would reckon.

None of those bandwidth testing apps exhibiting RST on egress however.

gthess commented 4 years ago

what is the net.ipv4.tcp_fastopen setting on that machine?

It is a personal computer so it was set at 1. I also tried with 3 but no change on the results.

whilst kernel 4.19.93 is not fresh off the mainline it still not that sort of legacy.

I could rule this out as I am currenly on 4.15.0.

The above (or similar) tcpdump could help (not only the RST packets). If you don't want to share it on github you can find my contact information here.

Otherwise I could also go for a virtualised openwrt environment when I have the time.

ghost commented 4 years ago

Played net.ipv4.tcp_fastopen values through from 0 to 3, restarting unbound in between, but that did not change the outcome, thus would appear irrelevant to the matter. Or, even to the issue at all since with net.ipv4.tcp_fastopen=0 (TFO disabled) the issue would not be expected to exhibit if it were related to TFO?

Looked into ^[1] for potential cause of RST but that references merely a TFO SYN flood scenario.

Unless there is a related bug in the kernel on this node and the node you tested that would explain the different outcome here are some settings particular to this node that may impact the outcome.

net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_rfc1337 = 1
net.ipv4.tcp_ecn = 1
net.ipv4.tcp_fastopen = 3
net.ipv4.tcp_sack = 1
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_keepalive_time = 120
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_window_scaling = 1
net.core.somaxconn = 4096
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 2
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_slow_start_after_idle = 0

Having turned off IPv4 for unbound the same issue exhibits with IPv6

screen

^[1] https://tools.ietf.org/html/rfc7413

gthess commented 4 years ago

I think I got something, I can reproduce it but I don't get that many RST packets. I concluded for now that it is just TCP related and it may manifest more easily on your node based on the settings (or a combination of) you shared.

What I see is that every TCP stream where unbound sends the RST packet the following happens:

TCP is established;
Unbound queries for a domain;
The upstream resolver sends a [PSH, ACK] packet back with 2 length (the length of the packet that will follow);
Unbound sends and [ACK] and then a [FIN, ACK] packet closing the connection;
When the upstream resolver sends the DNS reply back, unbound replies with the [RST] packet because the connection is already closed by unbound.

Could you verify that you see the same in your dumps? Right-clicking on the [RST] packet and selecting "Follow->TCP Stream" would make it easier.

ghost commented 4 years ago

I have got a dump ready, just trying to figure out whether Wireshark provides a way to obfuscate ip in the dump...

For the moment the filtered stream looks like this

screen

ghost commented 4 years ago

Here is the dump, replaced the GUA for the unbound instance with its ULA dp_anon.pcapng.gz

gthess commented 4 years ago

My observation above although correct at the moment does not seem to be a pattern for subsequent dumps. Still looking into this.

@n8v8R it would help if you could share the pcap file of the dump (not just one stream) as it seems you are hitting the issue far more frequently than I am. I am now leaning that this is just unbound's behavior based on the resolution (closing connections) and not a TCP issue.

ghost commented 4 years ago

Sure, except hope that is understandable that I would prefer not to provide a dump exhibiting the full network activity. For that purpose the dump is filtered to traffic (in/out) with upstream resolver instances.

As previously the dump been sanitised to the effect of having

replaced the GUA for the unbound instance with its ULA

The archive dump and list.zip

also contains a list of upstream resolver instances being leveraged by unbound - which may account for

it seems you are hitting the issue far more frequently than I am.

since it would appear that the issue does not exhibit with each upstream instance equally, i.e. it exhibits with some of the upstream instances but not with all. Plus with an increase of upstream instance leveraged by unbound the frequency of the issue being exhibited may increase as well.

ghost commented 4 years ago

it would appear that the issue does not exhibit with each upstream instance equally, i.e. it exhibits with some of the upstream instances but not with all.

Out of the list of 25 upstream instance set for unbound the dump exhibits the issue for 16 instances - list of upstrema instance exhibiting the issue.txt, but I am not sure whether unbound went through all the 25 instances during the period the dump been captured

ghost commented 4 years ago

looking at my dumps it would seem being a recurring scheme

unbound transmits FIN ACK via TCP
upstream node responds PSH ACK via TLS
unbound responds RST

It is as if unbound closed (tore down) the socket after having transmitted FIN ACK and prior receiving the upstream response and thus rejects the response from the upstream node.

Searching the subject I came across these

[2] https://serverfault.com/questions/854692/client-sends-rst-after-fin-ack [3] https://osqa-ask.wireshark.org/questions/13986/why-tcp-reset-sent-after-receive-finack-packet [4] http://deepix.github.io/2016/10/21/tcprst.html

ghost commented 4 years ago

According to this draft document ^[5] the RST transmission may not be a bug after all but rather a design to clear the socket out of TIME_WAIT so the source port can be reused more quickly. Why it happens with some upstream nodes but not others seems a bit of enigma though.

In that case I probably have to apologise for having wasted everyone's time but hope for understanding that the traffic looked somehow incongruous, a waste of bandwidth and CPU cycles. __ ^[5] https://tools.ietf.org/html/draft-faber-time-wait-avoidance-00

gthess commented 4 years ago

No time waste at all, I would be interested if something was amiss on the TCP/TLS layer. Upon further investigation I can confirm what I hinted at with my previous comment; this is normal unbound behavior.

[FIN, ACK] is sent because the upstream resolver took longer than expected to reply. unbound has an internal mechanism to record RTT times for upstream servers. If for some reason an upstream server is slower than usual to reply, unbound closes the connection, notes the variance in RTT and retries again with the query (simplified).

As you also noticed consequent TCP packets from upstream for the same (now closed) stream will get an RST response back because unbound is not expecting anything more on that stream.

All in all I don't see an issue but thanks for reporting!

ghost commented 4 years ago

Thank you for the feedback.

[FIN, ACK] is sent because the upstream resolver took longer than expected to reply. unbound has an internal mechanism to record RTT times for upstream servers. If for some reason an upstream server is slower than usual to reply, unbound closes the connection, notes the variance in RTT and retries again with the query (simplified).

That probably explains why it is exhibited with some remote nodes but not others.

What is the RTT expectation for unbound? Is there a knob to adjust that (tight) value? Because it seems a waste of bandwidth and CPU cycles (both, local instance and upstream node) for

RST being send
unbound retrying again with a simplified query

What does the simplified query entail, as opposed to full query?

gthess commented 4 years ago

What is the RTT expectation for unbound? Is there a knob to adjust that (tight) value?

You could use infra-cache-min-rtt:; from the manpage: "Lower limit for dynamic retransmit timeout calculation in infrastructure cache. Default is 50 milliseconds. Increase this value if using forwarders needing more time to do recursive name resolution."

What does the simplified query entail, as opposed to full query?

That's on me. I meant the explanation of unbound's logic was simplified, not the query.

ghost commented 4 years ago

Thanks for the pointer. But now your analysis

[FIN, ACK] is sent because the upstream resolver took longer than expected to reply.

does not seem to hold - having looked into the last dump submitted here the upstream response appears to be well within the 50 ms cap (wireshark -> Time Display Format ->Seconds Since Previous Captured Packet)

Annotation 2020-02-11 130156

Having introduced infra-cache-min-rtt: 5000 and restarted unbound the issue still exhibits and thus seems unrelated to RTT

I meant the explanation of unbound's logic was simplified, not the query.

What entails the simplified logic then?

ghost commented 4 years ago

Is the RTT value for unbound pertinent to single packet round trip or accumulating the time from the initial packet egress to the last packet ingress?

Looking at this debug log unbound debug log.txt the RTT values printed are not matching the RTT of single packets as shown in the dump.

gthess commented 4 years ago

Is the RTT value for unbound pertinent to single packet round trip or accumulating the time from the initial packet egress to the last packet ingress?

The calculated RTT for TCP connections is indeed per DNS query-response cycle, not for the individual TCP packets themselves.

Having introduced infra-cache-min-rtt: 5000 and restarted unbound the issue still exhibits and thus seems unrelated to RTT

infra-cache-min-rtt influences the timeout value that unbound will use for TCP. By using 5000 here the RST packets go away. Maybe you still see them but less frequently?

What entails the simplified logic then?

I used simplified to refer to my explanation of what unbound does behind the scenes wrt RTT and timeouts. I used the word simplified because unbound does a lot more with the recorded RTT values eg. server selection. If you want you can read more here.

ghost commented 4 years ago

Thanks for the response and the publication, helped to clear up things.

Is the RTT value printed with unbound-control dump_infra representative/relevant to the matter discussed here? Because:

1) none of it is near 50 ms and thus that value seems not matching real world traffic, least not if being removed mutiple hops / BGP downstream peers from the upstream resolver. 2) trying to figure out a sensible baseline value for infra-cache-min-rtt and wondering whether the output from the dump can assist with that

By using 5000 here the RST packets go away. Maybe you still see them but less frequently?

Got ~50 upstream instance set in unbound - will take a bit of time to wade through them one by one - will get back once done.

N.B. Glad to see RPZ support (finally) arriving!

ghost commented 4 years ago

Got ~50 upstream instance set in unbound - will take a bit of time to wade through them one by one - will get back once done.

Went through a couple of only but even with infra-cache-min-rtt: 5000 the RST packets are still showing in the amount/frequency

gthess commented 4 years ago

none of it is near 50 ms and thus that value seems not matching real world traffic, least not if being removed mutiple hops / BGP downstream peers from the upstream resolver.

This value is used as the minimum for timing out connections, anything less during calculations it will be set to that value.

Is the RTT value printed with unbound-control dump_infra representative/relevant to the matter discussed here? Because: trying to figure out a sensible baseline value for infra-cache-min-rtt and wondering whether the output from the dump can assist with that

For that I would use unbound-control lookup <domain-name>. It prints out the same values that unbound will use when deciding which server to use. Important values there are the rto and rtt. rtt is the value that will be used for the timeout. rto is the value that records the responsiveness of the server and the one that decides which server to pickup (based on the link I provided earlier).

Went through a couple of only but even with infra-cache-min-rtt: 5000 the RST packets are still showing in the amount/frequency

You can run the above lookup command and check how each forwarder is performing from unbound's point of view. If rto goes higher than the configured 5000 value that means a timeout occurred and you will see the RST packet.

ghost commented 4 years ago

Despite infra-cache-min-rtt: 5000 and the rto being well below that value the RST packet is still emitted by unbound

unbound-control lookup gold.com

The following name servers are used for lookup of gold.com. forwarding request: Delegation with 0 names, of which 0 can be examined to query further addresses. It provides 1 IP addresses. 2620:fe::fe rto 476 msec, ttl 867, ping 124 var 88 rtt 5000, tA 0, tAAAA 0, tother 0, EDNS 0 assumed.

gthess commented 4 years ago

Is that an isolated query? RTO calculations keep happening with every DNS packet; it may be the case that by the time you saw the RST and the time you issued the unbound-control command the RTO was updated to the above value. IIRC you use a forwarder for ".".

ghost commented 4 years ago

For that lookup to gold.com it is an isolated query, here is another isolated one unbound-control lookup tiger.com with all 53 forwarders specified in unbound (the previous sample was with a single forwarder) lookup.log

IIRC you use a forwarder for "."

yes, that is from the start of this thread. Does it have a particular impact?

ghost commented 4 years ago

tried with kernel paramaters

net.core.netdev_tstamp_prequeue = 0
net.core.tstamp_allow_data = 0
net.ipv4.tcp_timestamps = 0

but that did not change the outcome.

The concern is that the forwarded upstream resolvers will notice the excessive amount of RST packets and block my node from connecting.

gthess commented 4 years ago

yes, that is from the start of this thread. Does it have a particular impact?

I was only referring to testing because you forward all queries to all forwarders, and makes testing less trivial to correlate/examine.

In your latest log I don't believe you used infra-cache-min-rtt: 5000.

The concern is that the forwarded upstream resolvers will notice the excessive amount of RST packets and block my node from connecting.

Have you experienced such thing or is it just an assumption?

ghost commented 4 years ago

I was only referring to testing because you forward all queries to all forwarders, and makes testing less trivial to correlate/examine.

Sure, for that purpose also tested with only a single forward instance in place.

In your latest log I don't believe you used infra-cache-min-rtt: 5000.

Correct, that been opposed to previous output (which had only single forwarding instance and that value in place) for comparison.

Have you experienced such thing or is it just an assumption?

Did not see it happen yet, also considering that I got a larger number of forwarding instances and thus spreading queries. Just if a forwarding instance monitors (firewall) they might start wondering about those RST packages from a client.

Unfortunately the OpenWrt repo does not provide a TCP tracepoint utility, which might have been a better way than a tcpdump to get to the bottom of it.

Anyway, for entire isolation:

restricted unbound to listen on the localhost only (eliminating client queries)
disabled all auth zones
set a single forwarder instance forward-addr: 2620:fe::fe@853#dns.quad9.net
restarted unbound
started packet capture
on the unbound instance ran dig elephant.com which produced the below output and this packet dump dump.gz :

; <<>> DiG 9.14.8 <<>> elephant.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31965
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;elephant.com.                  IN      A

;; ANSWER SECTION:
elephant.com.           60      IN      A       52.20.58.194
elephant.com.           60      IN      A       52.22.149.59

;; Query time: 463 msec
;; SERVER: ::1#53(::1)
;; WHEN: Wed Mar 25 14:07:22 GMT 2020
;; MSG SIZE  rcvd: 73

gthess commented 4 years ago

Could you provide the same dump with 1.1.1.1: forward-addr: 2606:4700:4700::1111@853#cloudflare-dns.com ?

ghost commented 4 years ago

Sure, here we go various_dumps.zip

Since the CF query did not exhibit the RST issue I picked randomly two other forwarders from my list, each in the same isolation mode, of which:

forward-addr: 2a03:b0c0:0:1010::e9a:3001@853#dot.securedns.eu exhibits the RST issue
forward-addr: 2001:610:1:40ba:145:100:185:17@853#dnsovertls2.sinodun.com does not exhibit the RST issue

gthess commented 4 years ago

Thanks, that verifies my suspicions now. As I was previously testing with the cloudflare resolver. the random RST packets I was seeing in the dumps were related with timeouts. Now with the quad9 resolver I get consistent RST packets that seem to be related to the closing of the TLS communication. Not sure if it is a bug yet, will look into it.

ghost commented 4 years ago

Appreciated. Just in case here is the entire list of forwarder instances deployed on this node fw instances.txt

I was wondering whether some tcp session/connection reuse configuration on the upstream instances could be the issue.

ghost commented 4 years ago

Been wondering whether perhaps the recent SNI patch would also remedy this issue?

gthess commented 4 years ago

Regarding the recent patch: SNI was always sent with DoT forwarders, it was mainly the zone transfers that lacked the feature (auth-zone and rpz).

I had some more time to look into this and tracked it down to TLS close-notify. The RST I always see with quad9 is because unbound sends a close-notify and closes (FIN) the connection (as it should do). Quad9 then replies with their own close-notify which is met by RST from unbound because the connection is already closed. Other TLS implementations (like CF), after they see the close-notify and the FIN they proceed to reply with FIN skipping their close-notify.

So these constant RST's you are seeing are from server implementations that send close-notify after the connection is already closed on the client side (unbound).

As these RST are different from the ones we were speculating before (timeouts, retries) and do not hinder the resolver's performance I don't see it as an issue on the TLS communication.

ghost commented 4 years ago

Thank you for the effort/feedback.

Basically then a misconfiguration on the upstream server(s), least to my understanding they should not reply with a close-notify after unbound's FIN, ACK. Too bad that bandwidth and CPU cycles being wasted that way.

NLnetLabs / unbound

[bug] DoT - TCP reset on egress #137