Closed RobbieTT closed 1 year ago
There seems to be a bunch of things going on, and some comments on each part could be useful. I'll put them down one by one.
For Unbound, normally, queries are processed in parallel. The server selection algorithm treats IPv4 and IPv6 equally. So it should be fine. When a query gets an answer from the upstream, it is immediately provided to the client, it does not wait for other client queries. In some cases it may wait for internally generated queries to validate DNSSEC security, but it does not wait until a client has been answered.
The pcap seems to show, from what I can tell on that screenshot, there is first traffic over IPv4 and later traffic over IPv6. This would be a normal sequence of events if the following happens, for example. The client asks a query, randomly the IPv4 is chosen, a one in 4 chance, roughly, that provides an answer. The answer is returned to the client. The client then asks another query. Randomly the IPv6 upstream is chosen to answer it, and this then provides an answer. That is then also returned to the client. The reason it is not in parallel, is that the client, the querier, is making queries in sequence. Like for example asking for an A and then later for a AAAA address, or other query types than IPv4 and IPv6 addresses. This is the standard behaviour of some queriers, notably things like resolv.conf and systemd, but there may also be options to have them make parallel queries.
Unbound has options to prefer the fastest server from the set. But normally this is not needed, and I think also not here. The default has a mix of randomness and preferring faster target from unresponsive targets, and also filtering unusual response targets, and I think is probably fine. Perhaps the response time issue that is seen, is not the ping-time responsiveness, but in fact the DNS resolution time. I mean, when a query is not in the cache for the destination, the server has to look up the data and this takes time. Much longer than the fast ping time that the upstream has. If this is considered an issue, what you could do, is not use forwarding, but instead have unbound run full resolver, and make this lookups itself. Then unbound is slow and takes that time to look up the data. But that behaviour is then visible, in the logs of Unbound, instead of hidden behind the upstream forwarder. That separates the concern of upstream forwarder speed from the resolution speed, or, makes it visible in logs. That said, there are options to make unbound prefer the fastest ping time, fast-server-permil: 900
for example, selects the fastest 90% of the time. The fast-server-num
option can fine tune it. The prefer-ip6
and prefer-ip4
options can make unbound's server selection algorithm prefer that type of address, it would still be parallel for making lookups. But select those addresses by preference.
It is possible to get information from unbound, and that could be easier to read, or have more information than the pcap, with like verbosity: 5
, unbound has extensive logs, and also prints the querier address for incoming queries. It would be useful to have this information, because it can reveal what is going on inside unbound. Can you get these logs? Then it could be used to debug the sequential processing issue.
It could also be useful to have more control over the incoming queries, in some way. Not sure what is used now, perhaps use a commandline lookup tool, and then control the sequential or parallel lookups by typing them one after the other or at the same time, in a different commandline terminal?
Also the unbound logs would show what is happening inside the TLS channel, but that content does not seem to be a problem, right now.
Thanks to all.
The pcap seems to show, from what I can tell on that screenshot, there is first traffic over IPv4 and later traffic over IPv6. This would be a normal sequence of events if the following happens, for example. The client asks a query, randomly the IPv4 is chosen, a one in 4 chance, roughly, that provides an answer. The answer is returned to the client. The client then asks another query. Randomly the IPv6 upstream is chosen to answer it, and this then provides an answer. That is then also returned to the client. The reason it is not in parallel, is that the client, the querier, is making queries in sequence. Like for example asking for an A and then later for a AAAA address, or other query types than IPv4 and IPv6 addresses.
In the pcap example above it was a single client asking a single query with pcap taken on the WAN side of the router/firewall. The domain name was kia.com, as I knew this would not be in the cache:
Regarding the fast-server-permil
, the pfSense documentation states a value of 900
is set by them as standard. This could be the case but I did not see it explicitly set in the unbound.conf
file. I have already noted a mismatch between an unrelated setting in the pfSense GUI vs unbound config, which works in the opposite sense of that stated, so this may need an expert eye on my actual unbound config using pfSense unbound's default settings for forwarding:
It is possible to get information from unbound, and that could be easier to read, or have more information than the pcap, with like verbosity: 5, unbound has extensive logs, and also prints the querier address for incoming queries. It would be useful to have this information, because it can reveal what is going on inside unbound. Can you get these logs? Then it could be used to debug the sequential processing issue.
Yes, that should be possible at the verbosity level requested.
It could also be useful to have more control over the incoming queries, in some way. Not sure what is used now, perhaps use a commandline lookup tool
I have been using dig from either a LAN client or from pfSense/BSD CLI. I have a further option to perform a look-up via the pfSense GUI directly but I presume that offers nothing different, under the hood.
Is dig sufficient for testing and is there an easy way to disable the cache temporarily (ie without killing the 'warm' cache contents) so I don't have to dream-up unlikely domain names?
How surprising that it makes two queries to resolve the domain name. If it gets one query, I would expect only one upstream query. The logs could tell what is going on. Or maybe it is a CNAME and then it resolves the target of the cname. In which case it looks very normal.
The config looks okay. Nothing I would note.
The unbound-control utility has a command to flush the cache for a name, it is then resolved the next time it is asked. unbound-control flush example.com
for example.
It is not making two concurrent queries then, there is only one query coming in, and this is getting answered.
dig should be great for testing. It is also possible to use dig or a commandline tool to make queries towards the upstream servers, at their IPv4 and IPv6 address, and then it shows what they answer.
Is it just this one time, or is IPv6 a lot slower than IPv4? I would imagine that the upstream resolver is located roughly in the same place, so the time increase is a bit. If the IPv6 connection has a lot of lag on it, something that can be attributed to tunnels, perhaps prefer-ip4: yes
is an option that can increase the speed.
...dig should be great for testing. It is also possible to use dig or a commandline tool to make queries towards the upstream servers, at their IPv4 and IPv6 address, and then it shows what they answer.
Perfect.
Is it just this one time, or is IPv6 a lot slower than IPv4? I would imagine that the upstream resolver is located roughly in the same place, so the time increase is a bit.
Typically Ipv6 is slightly faster and the raw returns from the forwarder are typically in the <12ms range. Handshakes and DoT increase that by multiples; add in a use of a temporarily slower server and things can become quite protracted.
☕️
The screenshot of timers, shows that the ping times are fine. But the IPv6 addresses, both of them, have an RTO that is much larger than the RTT, this is an increase because of recent timeouts. It has exponentially backed off. The IPv4 addresses are fine, and do not seem to have a lot of timeouts, even though 1 is listed in the timeout other column. The ping is fine, when it connects, but the IPv6 addresses have timeouts, perhaps this is causing slowdown.
The ping is fine, when it connects, but the IPv6 addresses have timeouts, perhaps this is causing slowdown.
It could be exactly that but I am not sure how these counters work. If they are cumulative the errors over the ~12 days the router has been 'up' they look inconsequential.
Another thing I don't understand in the table is the reported ping time, which varies between 20 to 28 ms. If I ping any of those servers directly they return an average of 7.522 ms. I have run ping plotter against them and the trace is reassuringly flat. Pinging from the router itself to 9.9.9.9 directly provides these figures:
[23.05-RELEASE][admin@Router-8.redacted.me]/root: ping 9.9.9.9
PING 9.9.9.9 (9.9.9.9): 56 data bytes
64 bytes from 9.9.9.9: icmp_seq=0 ttl=62 time=7.470 ms
64 bytes from 9.9.9.9: icmp_seq=1 ttl=62 time=7.590 ms
64 bytes from 9.9.9.9: icmp_seq=2 ttl=62 time=7.571 ms
64 bytes from 9.9.9.9: icmp_seq=3 ttl=62 time=7.529 ms
64 bytes from 9.9.9.9: icmp_seq=4 ttl=62 time=7.534 ms
64 bytes from 9.9.9.9: icmp_seq=5 ttl=62 time=7.392 ms
64 bytes from 9.9.9.9: icmp_seq=6 ttl=62 time=7.483 ms
64 bytes from 9.9.9.9: icmp_seq=7 ttl=62 time=7.589 ms
64 bytes from 9.9.9.9: icmp_seq=8 ttl=62 time=7.563 ms
64 bytes from 9.9.9.9: icmp_seq=9 ttl=62 time=7.504 ms
^C
--- 9.9.9.9 ping statistics ---
10 packets transmitted, 10 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 7.392/7.522/7.590/0.059 ms
[23.05-RELEASE][admin@Router-8.redacted.me]/root:
The values are those expected of my connection but not representative of the table extracted from unbound.
Pinging from a wired client behind the router is no different (aside from the TTL & added latency of the extra hop):
[snip]
64 bytes from 9.9.9.9: icmp_seq=9 ttl=61 time=7.888 ms
^C
--- 9.9.9.9 ping statistics ---
10 packets transmitted, 10 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 7.703/7.816/7.919/0.076 ms
rob@Smaug ~ %
Of course, the unbound data requested of me earlier can only help (waiting for a suitable network opportunity) but if any clarity can be added on how to correctly interpret the data in the unbound table above would be helpful.
☕️
The higher value of the ping is likely the TLS handshake that is counted in. If forward-tls-upstream was turned off, unbound would use UDP and then get those ping values too. Yes the logs could be useful. The timeouts look to be a problem, to me, because the high value of 600+ msec compared to the roundtrip time of 70-80 msec, that is about 8x, or 3 timeouts that happened in sequence. So both of them have had 3 connections fail. Now the RTO is so large, that unbound is likely no longer choosing them and is using only the IPv4 addresses by preference.
Answering one of my own questions as just took a snapshot of the unbound table as it sits this morning:
So the timeout errors are not cumulative but the ping values still look odd - especially achieving a value below the physical latency of my connection.
@wcawijngaards our responses crossed each other but your comments makes my table above look even weirder. I'll let you ponder them but I suspect that you probably don't have enough data from me just yet to fully explain them.
☕️
The low value is very likely due to lack of information, there are almost no observations of a roundtrip, and the estimate is mostly variance, the average is just uncertain. So that is not really a problem. The RTT value is the ping value and the variance combined, and the RTO value has timeout backoff applied to it. No, the timeouts are kept track of temporarily, and this table shows again timeouts that have happened.
@wcawijngaards Thanks for the subtle prod over TCP, I could have added +tcp to the ping command but, well, I didn't - Doh!
I'm now fighting the unbound-control flush example.com
command as it returns the following
[23.05-RELEASE][admin@Router-8.redacted.me]/root: unbound-control flush example.com
[1687341019] unbound-control[54714:0] warning: control-enable is 'no' in the config file.
[1687341019] unbound-control[54714:0] error: connect: Connection refused for 127.0.0.1 port 8953
[23.05-RELEASE][admin@Router-8.redacted.me]/root:
This makes simple & repeatable testing somewhat challenging.
The pfSense config of unbound has this at the very end:
###
# Remote Control Config
###
include: /var/unbound/remotecontrol.conf
The file referenced contains this:
[23.05-RELEASE][admin@Router-8.redacted.me]/root: cat /var/unbound/remotecontrol.conf
remote-control:
control-enable: yes
control-interface: 127.0.0.1
control-port: 953
server-key-file: "/var/unbound/unbound_server.key"
server-cert-file: "/var/unbound/unbound_server.pem"
control-key-file: "/var/unbound/unbound_control.key"
control-cert-file: "/var/unbound/unbound_control.pem"
[23.05-RELEASE][admin@Router-8.redacted.me]/root:
So it does not reflect the example in the unbound documentation and, in this case, it does not work.
Any ideas on what needs fixing?
☕️
[I seem to be dragging you around with my unfamiliarity with unbound & pfSense; I do apologise and appreciate the help.]
The unbound-control tool seems to be reading from a different config file. The -c
option can be used to specify one.
I tried the -c command to point at the actual unbound.conf
file location but it didn't accept it:
[23.05-RELEASE][admin@Router-8.redacted.me]/root: unbound-control -c /var/unbound/unbound.conf
Usage: unbound-control [options] command
Remote control utility for unbound server.
[...snip]
I also checked the unbound-control status
, post the -c command:
[23.05-RELEASE][admin@Router-8.redacted.me]/root: unbound-control status
[1687349207] unbound-control[67259:0] warning: control-enable is 'no' in the config file.
[1687349207] unbound-control[67259:0] error: connect: Connection refused for 127.0.0.1 port 8953
unbound is stopped
[23.05-RELEASE][admin@Router-8.redacted.me]/root:
I'm not sure if this is relevant or not but the remotecontrol.conf
file has a control-port of 953 vs 8953 shown above:
[23.05-RELEASE][admin@Router-8.redacted.me]/root: cat /var/unbound/remotecontrol.conf
remote-control:
control-enable: yes
control-interface: 127.0.0.1
control-port: 953
server-key-file: "/var/unbound/unbound_server.key"
server-cert-file: "/var/unbound/unbound_server.pem"
control-key-file: "/var/unbound/unbound_control.key"
control-cert-file: "/var/unbound/unbound_control.pem"
[23.05-RELEASE][admin@Router-8.redacted.me]/root:
I'm either lost in unbound syntax or failing to understand how pfSense arranges the unbound.conf
and the additional files it points to.
☕️ time
The first command is failing because there is no command on the commandline. The second has a status command.
Aargh, I think my brain has failed and your patience is appreciated. I guess I should have been typing:
unbound-control -c /var/unbound/unbound.conf flush example.com
unbound set at verbosity 5, using just 2 Quad9 IPv4 server addresses, so I am presuming no major issues here:
I'll re-run with the Quad9 IPv6 name servers added to the forwarding list when network traffic allows.
☕️
@wcawijngaards
It is possible to get information from unbound, and that could be easier to read, or have more information than the pcap, with like
verbosity: 5
, unbound has extensive logs, and also prints the querier address for incoming queries. It would be useful to have this information, because it can reveal what is going on inside unbound. Can you get these logs? Then it could be used to debug the sequential processing issue.Also the unbound logs would show what is happening inside the TLS channel, but that content does not seem to be a problem, right now.
As requested:
The DNS Resolver Infrastructure Cache Speed summaries can look quite wild:
I hope these help you shed some light as to why the query response time can become so protracted when using both IPv4 and IPv6 forwarders.
☕️
In the logs there is one oddity, it seems this happens:
Jun 27 17:12:18 Router-8 unbound[71149]: [71149:3] debug: comm point listen_for_rw 44 0
Jun 27 17:14:32 Router-8 unbound[71284]: [71284:0] debug: chdir to /var/unbound
That looks like unbound 71149 ceases to exist, and then about two minutes and 14 seconds later it starts again. That, if intentional is harmless, but there would normally be logs about the shutdown sequence.
From the infra cache stats, it looks like there are a lot of timeouts happening. Also for an IPv4 address, I spot. And the other timeouts for IPv6 addresses. These timeouts cause unbound to pause and wait, and I think, but I already mentioned earlier, are the cause of the wait times. Something must be wrong with the connection. I would think that fixing the upstream connectivity would likely be the issue.
Somehow this does not happen if only IPv4 is used? The presence of IPv6 traffic causes packet drops?
From the logs, there is no mention of dropped connections in there. Also because they are fairly short, I guess they did not capture them. I guess it would not look like anything in particular, if this is some sort of loss of network connectivity. Normally, networks connectivity does not have this kind of packet drops, like 0 drops, would be expected. Also to these forwarders, would not expect packet drops.
Perhaps prefer-ip4 can help, if ip6 network connectivity just does not work right, eg. has packet drops. Or fix the IPv6 connection. But then there is also some packet drops for IPv4. The TLS connections lag a bunch, and this is the TLS stack that does that. If the forwarders are configured to use UDP, unbound chooses a timeout, and that would likely be fairly short, 93 msec, for the IPv4 addresses, and this retry is quick and easy. It may be easier to help around this packet lossy network connection, as a packet drop for IPv4 then is only a couple hundred msec of delay, once in a while.
Unbound can configure retries, the default is 5 retries to a server, outbound-msg-retry: 5
.
Also the wait times for UDP responses can be configured, infra-cache-min-rtt: 50
and infra-cache-max-rtt: 120000
.
But the defaults are fine, and unbound should retry for the IPv4 failures. I would then disable the IPv6 forwarder addresses because of the packet loss on IPv6. And then use UDP to more easily recover from the packet loss on IPv4. Not sure if this is helpful, perhaps diagnosing the actual network connection fixes the more underlying issue. It is possible to make queries from the commandline to the upstream IP addresses, but to see the packet loss that would need a lot of repeats or so.
In the logs there is one oddity, it seems this happens:
Jun 27 17:12:18 Router-8 unbound[71149]: [71149:3] debug: comm point listen_for_rw 44 0 Jun 27 17:14:32 Router-8 unbound[71284]: [71284:0] debug: chdir to /var/unbound
That looks like unbound 71149 ceases to exist, and then about two minutes and 14 seconds later it starts again. That, if intentional is harmless, but there would normally be logs about the shutdown sequence.
I presume it was caused by this line:
Jun 27 17:12:18 Router-8 unbound[71149]: [71149:0] info: control cmd: stats_noreset
From the infra cache stats, it looks like there are a lot of timeouts happening. Also for an IPv4 address, I spot. And the other timeouts for IPv6 addresses. These timeouts cause unbound to pause and wait, and I think, but I already mentioned earlier, are the cause of the wait times. Something must be wrong with the connection. I would think that fixing the upstream connectivity would likely be the issue.
The upstream connection to the forwarders is perfect with hardly a ripple on PingPlotter and steady at 7.2ms. It is a 1 Gbit fibre service via a 2.5 GbE ONT. Clearly I am blind to what happens behind Quad9's door but I tried the same tests with the Cloudflare equivalents and there was no change in the symptoms or performance.
Somehow this does not happen if only IPv4 is used? The presence of IPv6 traffic causes packet drops? From the logs, there is no mention of dropped connections in there. Also because they are fairly short, I guess they did not capture them. I guess it would not look like anything in particular, if this is some sort of loss of network connectivity. Normally, networks connectivity does not have this kind of packet drops, like 0 drops, would be expected. Also to these forwarders, would not expect packet drops.
The issue appears when you add IPv6 addresses to the forwarder list. When I added Cloudflare IPv6 addresses to the 4 current forwarders the problem increased further - more IPv6 addresses = more issues. I'm not seeing anything that suggests dropped packets on the WAN side and the PCAPs support this; only that the timings get longer and are cumulative, tripping a timer somewhere.
Perhaps prefer-ip4 can help, if ip6 network connectivity just does not work right, eg. has packet drops. Or fix the IPv6 connection. But then there is also some packet drops for IPv4. The TLS connections lag a bunch, and this is the TLS stack that does that. If the forwarders are configured to use UDP, unbound chooses a timeout, and that would likely be fairly short, 93 msec, for the IPv4 addresses, and this retry is quick and easy. It may be easier to help around this packet lossy network connection, as a packet drop for IPv4 then is only a couple hundred msec of delay, once in a while.
If it comes to it I will have to remove the IPv6 addresses but this is not ideal. I'd rather stick with DoT, so the TLS / TCP handshakes are just one of those things. If UDP is the answer it kind of of defeats the reason I moved away from Dnsmasq as my caching dns service.
Unbound can configure retries, the default is 5 retries to a server...
Where should I look for these retries (if present) as all I see in the pcaps is the sequential use of IPv6 after the IPv4 and the total time of these queries driving up the answer time?
Whilst you should read the next observation with care, as I am far from sure of what I am seeing myself, but some of the expanded / delinquent timings seem to be associated with additional probing post the main request (prefetch activity?) - with the client only receiving its actual answer once all the other queries are complete. Again, I am not sure of what is going on under the hood but there is a lot more activity going on when the query times go sideways. Timings get summed and multiplied whilst the pcaps show nothing but protracted but otherwise normal activity.
I included a unbound-control stats_noreset snapshot
in the previous post and to my eyes it looks normal - would you expect to see failures in this data if packets were being dropped or malformed somewhere?
Apologies with the clipped data provided previously, I was limited by the character limit. I do have larger logs so feel free to point me at things to look at or grep from.
Thanks again for looking at this.
☕️
So, if the upstream is working fine, the issue must be close to the server. If not actually the server itself. This happens when IPv6 is used, and more IPv6 causes more issues. The issues are packet drops, for IPv6, but also a packet drop for IPv4 is visible in the infra stats. This then causes slowdown.
The first ipv4 and then ipv6 behaviour is caused by unbound selecting the best servers, and that is the ipv4 servers because they do not packet drop. Then unbound retries and attempts the IPv6 servers after that. That means the IPv4 failed somehow. it could also be random selection, and that should be even weighted, because that is what the unbound server selection code does.
The statistics output did not look problematic to me, there is the long resolution times when timeouts must be happening.
If the problem is close to the server, something must be wrong. If unbound is just creating a socket, then the system, network card, cable, network router or more network equipment up to the working WAN link is then likely the cause, and drops packets once IPv6 gets enabled. The failure where the process ceases to exist is not explained by the stats_noreset command, that should not end the server process. The process seems to have been killed, and then it is restarted two minutes later. If that is caused by a failure in hardware, like the mainboard, or overheating, and then the router restarts, that could explain it, and may also explain packet drop behaviour. Or the problem could be in software, if the machine is out of memory, unbound should log out of memory, but the linux OOM killer, can kill the process if the machine is out of memory without further logs from Unbound. And the machine could run out of memory because the extra ipv6 sockets use buffer memory. Perhaps it drops packets because of lack of buffer space, causing the connection failures?
Unbound does not actually perform probing, there is a root key sentinel lookup in recent versions, but that is only once. And other queries only happen in line with client queries. Sometimes unbound sends another lookup because of a failure, or CNAME chasing.
Unbound is running on a Netgate 6100 Max (v23.05.1 on FreeBSD 14.0-Current). It has ix and igc interfaces, with the WAN on an igc (2.5 GbE) interface but running at 1 GbE. The LANs are SFP+ DACs and the 6100 comes with its standard 8GB RAM & Atom C3558 (with QAT active). The only non-standard hardware is the SSD, which is a pure Optane drive (64GB). It has physical resources to spare but I appreciate that may not always mean available, if there is a config issue:
For more granular details, top does not seem to show anything of note (at least to my eyes). NB that my WAN is via PPPoE so it places most of the actual traffic via a single core due to the BSD vs PPPoE issue:
I've not found any errors in my WAN traffic or with the simple cabling from the router to the ONT. For most of the testing I am not using the LAN side, to rule out the influence of switches, DACs, RJ45 etc. The LAN side errors recorded are non-zero (~460 total 'in' packets, so many orders of magnitude below 1%).
Looking a bit wider I did find 2 things of note. Running the icmpcheck.popcount.org 'Frag Test' it fully passed on IPv6 but did have issues with IPv4; somewhat the reverse of what I was expecting:
The other thing of note was simultaneously comparing IPv4 and IPv6 to Quad9 using PingPlotter. This is being run on a headless Mac acting as a server on a 10 GbE link, so it does remove the purity of testing from the router alone. I have zero packet loss from the server-switch-switch-router-WAN-ISP link on either IPv4 or IPv6 but there is significant packet loss on IPv6 further upstream, just 1 hop away from Quad9's servers (the PingPlotter 'timeout' is set to the default 3000ms):
Could this be the potential WAN-side issue you were looking for?
☕️
(If I can test in a better or more productive way just point me in the right direction.)
Looking at the IPv6 address that the packets are being dropped at it corresponds to this place:
LONAP is a "not for profit" Layer 2 Internet Exchange Point (IXP) based in London. Our data-centres host a network of interconnected switches providing free-flowing peering to help minimise interconnection costs. We provide exclusive connectivity between members, who are effectively LONAP stakeholders. This ensures that LONAP members enjoy excellent value and maximum benefits:
☕️
It is nice that LONAP delivers not-for-profit IXP connectivity. The 25% packet loss indication looks like the WAN-side issue we were looking for. The 0.2 % for IPv4 is also important to note, because that means degraded performance. The cutout of the process is also worrying, in that the server process disappeared.
But likely the 25% packet loss for IPv6, apparently some of the time, is certainly something that grinds connectivity to a halt. I do not think that TCP or TLS is going to cope with that sort of number. And it seems to not be doing so, in this issue.
So, a solution is to not list the IPv6 addresses. Still have the 0.2% IPv4 trouble and process cutout issues. But then it avoids the 25% packet loss.
Another is to use UDP, instead of TLS. In that case, Unbound performs the retries and they are much faster and light weight, comparatively, so that would be able to work. But since the IPv6 host is the same host as the IPv4 address, it is simply another way to contact that upstream service, perhaps this is not as useful as just using IPv4.
Also, it is possible to remove the forward altogether, and have unbound run as full resolver, contacting authority servers. Because that is likely not going over that packet loss hop towards the particular upstream forwarder service, for most lookups, it would likely work. It is then not using TLS.
Unbound is configured to not use fragments, if possible, something that is advocated for DNS servers. So the fragment failure is not really an issue, at all.
@wcawijngaards - Thanks again and I have raised a support ticket (#32073) with Quad9 and I will report back here with any details they provide.
☕️
Quad9 are on the case as they can:
...replicate the packet loss to multiple destination IPv6 addresses on different networks, where the packet loss is occurring on the router of one of our upstream providers.
☕️
Quad9 appears to have resolved the issue with both IPv6 changes and a significant capacity upgrade that went live yesterday evening:
Thanks for your patience. We've deployed considerably more capacity in London, and a lot of traffic shifted to our new "qlhr1" PoP. Can you tell us if your metrics have improved since yesterday evening (UTC)?
My unbound resolver stats now look very healthy:
Thanks to all and I am marking this topic as closed.
☕️
Forwarding to a mixed block of IPv4 and IPv6 name server addresses effectively doubles the query response time, as the IPv6 sever address query does not start until the IPv4 query has completed (ie it runs sequentially, not in parallel). Additionally, both IPv4 and IPv6 queries have to be fully resolved before any answer is provided to clients.
unbound Version 1.17.1 as bundled with pfSense Plus Version 23.05-Release
Desired behaviour:
With a list of forwarding name servers containing both IPv4 and IPv6 addresses (example below) the lookups should run in parallel with the option of selecting the fastest response from the 2 chosen servers selected by unbound in the normal manner. This would also provide an element of fallback should either the IPv4 or IPv6 address fail to provide a response, as well as a faster 'first-past-the-post' response.
It is accepted that the number of queries sent will still be doubled (as it is now) but by running in parallel it would avoid a faster IPv4 response being masked to the client until the IPv6 query has started and run to completion (or vice versa). As a stretch target, it would be ideal if the normal unbound forwarder selection behaviour was IPv4/6 agnostic, allowing either address protocol to be utilised by the selection algorithm, as this would halve the traffic and mimic the current behaviour if only IPv4 or IPv6 addressed forwarders were in use.
Attachments PCAP overview showing sequential IPv4 query-response + IPv6 query-response-answer 6 & answer 4:
Forwarding addresses used in the above example:
forward-zone: name: "." forward-tls-upstream: yes forward-addr: 9.9.9.9@853#dns.quad9.net forward-addr: 149.112.112.112@853#dns.quad9.net forward-addr: 2620:fe::fe@853#dns.quad9.net forward-addr: 2620:fe::9@853#dns.quad9.net
☕️