DNS-over-QUIC upstream servers no longer work on v0.107.41

Freekers commented 1 year ago

Prerequisites

[X] I have checked the Wiki and Discussions and found no answer
[X] I have searched other issues and found no duplicates
[X] I want to report a bug and not ask a question or ask for help
[X] I have set up AdGuard Home correctly and configured clients to use it. (Use the Discussions for help with installing and configuring clients.)

Platform (OS and CPU architecture)

Custom (please mention in the description)

Installation

Docker

Setup

Other (please mention in the description)

AdGuard Home version

v0.107.41

Action

Click 'Test Upstreams'

Expected result

Confirmation that the upstream server is working correctly.

Actual result

Server "quic://XXXXX.dns.nextdns.io": could not be used, please check that you've written it correctly

Additional information and/or screenshots

I'm running two AGH instances. After updating both instances from v0.107.40 to v0.107.41, one instance works fine but on the other one upstream DNS-over-QUIC servers no longer work. The error displayed is: Server "quic://XXXXX.dns.nextdns.io": could not be used, please check that you've written it correctly I also tried using the QUIC server of Adguard but the issue is the same.

Both instances run on Docker. However, the host OS is different. The working instance runs Ubuntu Server 22.04. The broken/non-working instance is running on a Synology NAS (x86_64 GNU/Linux synology_apollolake_918+) I've already deleted the container and repulled the image, but the problem is still there. This DNS-over-QUIC upstream server was working on both instances on v0.107.40

I enabled debug logging and found the following which could be related;

2023/11/14 15:26:31.325506 1#55 [debug] bootstrap: dialing 45.11.106.155:853 (1/4)
2023/11/14 15:26:31.326218 1#55 [debug] bootstrap: connection to 45.11.106.155:853 succeeded in 114.467µs
2023/11/14 15:26:31.328239 1#55 [debug] dnsproxy: upstream quic://XXXXX.dns.nextdns.io:853 failed to exchange ;HsGH2CJwy_JPd3x0T.multi.surbl.org.   IN   A in 138.551848ms: opening quic connection to quic://XXXXX.dns.nextdns.io:853: INTERNAL_ERROR (local): write udp [::]:44035->45.11.106.155:853: sendmsg: invalid argument
2023/11/14 15:26:31.328496 1#55 [debug] proxy: replying from upstream: opening quic connection to quic://XXXXXX.dns.nextdns.io:853: INTERNAL_ERROR (local): write udp [::]:44035->45.11.106.155:853: sendmsg: invalid argument
2023/11/14 15:26:31.328663 1#55 [debug] dnsforward: finished processing upstream

This issue seems related to: https://github.com/AdguardTeam/AdGuardHome/issues/6301 and https://github.com/AdguardTeam/AdGuardHome/issues/6335 which was resolved in v0.107.40

ainar-g commented 1 year ago

Thanks for the report and the logs. I suspect that the kernel version may be the reason for the difference. Can you show the output of uname -a on both machines?

Also, does adding QUIC_GO_DISABLE_ECN=true on the machine with the issue fix it?

Freekers commented 1 year ago

Thanks for the report and the logs. I suspect that the kernel version may be the reason for the difference. Can you show the output of uname -a on both machines?

Also, does adding QUIC_GO_DISABLE_ECN=true on the machine with the issue fix it?

Output of uname -a on the working machine:

Linux raptor 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Output of uname -a on the broken machine:

Linux TurboPolyp 4.4.180+ #42962 SMP Mon May 29 14:38:23 CST 2023 x86_64 GNU/Linux synology_apollolake_918+

Where do I add QUIC_GO_DISABLE_ECN=true? Is this an environmental variable? If so, I suppose I would need to enter this inside the Docker container, correct?

Thanks

ainar-g commented 1 year ago

Thanks for the info.

Where do I add QUIC_GO_DISABLE_ECN=true? Is this an environmental variable? If so, I suppose I would need to enter this inside the Docker container, correct?

Yes, this should be set in the container's environment. The AdGuardHome binary should be able to observe the value of that environment variable.

Freekers commented 1 year ago

Thanks for the info.

Where do I add QUIC_GO_DISABLE_ECN=true? Is this an environmental variable? If so, I suppose I would need to enter this inside the Docker container, correct?

Yes, this should be set in the container's environment. The AdGuardHome binary should be able to observe the value of that environment variable.

I have set the environment variable inside the container as follows:

docker exec -it adguard sh
/opt/adguardhome/work # echo $QUIC_GO_DISABLE_ECN

/opt/adguardhome/work # export QUIC_GO_DISABLE_ECN=true
/opt/adguardhome/work # echo $QUIC_GO_DISABLE_ECN
true

But sadly it does not fix the issue (same error message). I also tried using the edge image, same issue.

ainar-g commented 1 year ago

If you're running AGH with something like docker run, you should use the -e/--env.

Freekers commented 1 year ago

If you're running AGH with something like docker run, you should use the -e/--env.

Oops, my bad, you're right. I've now set the environmental variable in my docker-compose file as follows:

services:
  adguard:
   image: adguard/adguardhome:latest
   restart: always
   container_name: adguard
   network_mode: "host"
   environment:
    - TZ=Europe/Amsterdam
    - QUIC_GO_DISABLE_ECN=true
   volumes:
    - /volume1/docker/adguard/work:/opt/adguardhome/work
    - /volume1/docker/adguard/conf:/opt/adguardhome/conf

I can confirm that the issue is now resolved. The QUIC upstream DNS server now works again, thank you.

What does this setting QUIC_GO_DISABLE_ECN=true do exactly?

Thanks

ainar-g commented 1 year ago

What does this setting QUIC_GO_DISABLE_ECN=true do exactly?

It disables additional congestion-control features added to quic-go in v0.39.0.

It's good that the workaround works, but it's still weird, as AGH v0.107.41 uses quic-go v0.39.2, which should have fixed the sendmsg: invalid argument issue. Perhaps Synology has a weird kernel build.

@marten-seemann, is there any way we could debug this further?

jumpsmm7 commented 1 year ago

What does this setting QUIC_GO_DISABLE_ECN=true do exactly?

It disables additional congestion-control features added to quic-go in v0.39.0.

@ainar-g It turns out @marten-seemann only patched this for FreeBSD, AMD64 (aka x86_64) Environment. https://github.com/AdguardTeam/AdGuardHome/issues/6301

Users of Asuswrt-Merlin routers are also experiencing this issue:-https://www.snbforums.com/threads/adguardhome-new-releases-2023.85191/post-875540. As a temporary fix, I plan to add the QUIC_GO_DISABLE_ECN=true option to the Env variable PREARGS until adequate fix has been provided.

Here is an example of the environment of Asuswrt-Merlin Routers:

ASUSWRT-Merlin RT-AX88U_PRO 3004.388.4_0 Mon Aug 21 19:34:19 UTC 2023 admin@RT-AX88U_Pro-29B8:/tmp/home/root# uname -a Linux RT-AX88U_Pro-29B8 4.19.183 #1 SMP PREEMPT Mon Aug 21 15:34:46 EDT 2023 aarch64 ASUSWRT-Merlin

HTH

marten-seemann commented 12 months ago

How would the cmsg look on other platforms? Would be good to fix this in quic-go, the env is just an escape hatch and shouldn’t be a permanent solution.

ainar-g commented 12 months ago

@jumpsmm7, you're pointing to the FreeBSD issue, but all Linux platforms should have been fixed in #6335. See quic-go/quic-go#4127.

As for the control message, I'm leaning towards this being a change in the Linux kernel somewhere around v5, since so far this seems to affect only those with kernels in the v4.x branch, but I don't have any sold proofs just yet.

ainar-g commented 12 months ago

@marten-seemann, another theory I've had is that the issue could have something to do with how quic-go sets IP_TOS/IP6_TCLASS depending on whether or not an IP address is convertible to IPv4 rather than checking for the socket family. It could also be dependent on sysctl net.ipv6.bindv6only, although I cannot reproduce any errors either way on my Ubuntu with v5.15.0 kernel. I've seen some C code that just sets both, too, but I'm not sure if that's the correct solution.

FNsi commented 12 months ago

I cannot creat an issue in GitHub mobile client, all be teleport to discord.

That's the same thing I face in my old arm-v7 android device. (uname -a linux 3.4.39 armv7)

marten-seemann commented 12 months ago

So this might just be due to ancient kernels. Is anyone aware of a way to detect support for these cmsgs, ideally without parsing kernel version numbers?

ainar-g commented 12 months ago

@marten-seemann, my guess would be that getting this EINVAL is the way. Perhaps, the code should send the message with the ECN data, check if the error is EINVAL, and, if it is, retry sending without the ECN data. If that second send succeeds, ECN is likely not supported in the kernel.

Also, as a related question, are there any plans to allow library clients to disable ECN through the Config structure? Using setenv to configure a library isn't exactly ideal, and there may be some clients who want to disable the feature regardless of the support.

marten-seemann commented 12 months ago

@marten-seemann, my guess would be that getting this EINVAL is the way. Perhaps, the code should send the message with the ECN data, check if the error is EINVAL, and, if it is, retry sending without the ECN data. If that second send succeeds, ECN is likely not supported in the kernel.

We already have similar logic for GSO: https://github.com/quic-go/quic-go/blob/3bf2e19d0dc617135ec9d6f3c5191740a27097c7/send_conn.go#L62-L68. I assume we could build something similar for EINVAL, but it's a bit unfortunate too much such an unspecific error code.

Also, as a related question, are there any plans to allow library clients to disable ECN through the Config structure? Using setenv to configure a library isn't exactly ideal, and there may be some clients who want to disable the feature regardless of the support.

What's the use case for that?

ainar-g commented 12 months ago

What's the use case for that?

Situations where the developers know that the software is likely to be run on older/modified kernels without proper ECN support.

ardel commented 11 months ago

I can confirm that HTTP/3 doesn't work in Synology Docker under v0.107.43 Setting the env variable as advised above resolved the issue.

$ uname -a
Linux DS920 4.4.59+ #25556 SMP PREEMPT Tue Mar 21 22:25:44 CST 2023 x86_64 GNU/Linux synology_geminilake_920+

2023/12/11 20:41:36.202227 1#47 [debug] dnsproxy: https://cloudflare-dns.com:443/dns-query: response received over udp: "requesting https://cloudflare-dns.com:443/dns-query: Get_0rtt \"https://cloudflare-dns.com:443/dns-query?dns=AAABAAABAAAAAAAABHRlc3QAAAEAAQ\": INTERNAL_ERROR (local): write udp [::]:40657->104.16.248.249:443: sendmsg: invalid argument"

marten-seemann commented 11 months ago

I'd need some more hints debugging this. It's really hard to make any fixes if I can't reproduce this locally.

I already installed Ubuntu 18.04 in a VM (4.15.0-213-generic on aarch64), but everything works fine here.

ardel commented 11 months ago

Can someone try reproducing it in Ubuntu 16.04 that has 4.4 kernel? https://wiki.ubuntu.com/XenialXerus/ReleaseNotes

FNsi commented 11 months ago

I guess it should be a problem earlier than 4.14 Mine is 3.4. Others in this issue are 4.4. , 4.5.

marten-seemann commented 11 months ago

I'm unable install Ubuntu 16.04 due to some weird virtualization errors, both in UTM and in Parallels. The earliest version I can install is 18.04.

marten-seemann commented 11 months ago

I managed to run Ubuntu 14.04 and the "sendmsg: invalid argument" reproduces there.

It looks like the change we introduced in response to https://github.com/AdguardTeam/AdGuardHome/issues/6335 is causing the issue: If I use a 4 byte value for the IP_TOS cmsg, it works on old kernels (despite man 7 ip claiming that IP_TOS is a byte and not a uint32).

Re-reading #6335 I'm not sure anymore why we reduced the cmsg value to 1, other than to be more conformant with what the man page says. Newer versions of Linux seem to accept both values. I'm planning to revert the change (https://github.com/quic-go/quic-go/pull/4127), unless someone has a better idea how to fix this problem.

ainar-g commented 11 months ago

@marten-seemann, what about this comment? The original reason wasn't just to follow the manual but also because the size was causing reproducible issues that went away after the change to 1.

marten-seemann commented 11 months ago

I wasn't able to reproduce this failure. Maybe it only occurs on MIPS? Frankly, properly supporting amd64 and arm64 on all kernel versions is more important than other architectures, and we could disable ECN on mips altogether.

ainar-g commented 11 months ago

It's definitely not MIPS-only, because I ran the test on a machine running AMD64, and so did a lot of people for whom size 1 fixed that issue.

Considering that 4 is the size of an IPV6_TCLASS message, are you sure that the issue isn't that an IPv6 socket is receiving mapped IPv4 queries and thus there is a protocol mismatch, as I've described previously? Judging by some questions (like this and this), it was one of the things that had changed between 16.04 and 18.04.

marten-seemann commented 11 months ago

Yes. Please try out 14.04, size 1 fails there reliably, whereas size 4 works reliably. Size 1 seems to continue causing problems, see https://github.com/quic-go/quic-go/issues/4178 for example.

ainar-g commented 11 months ago

I'm getting no errors with our dnsproxy (using quic-go@v0.39.1) and a QUIC upstream on qemu with Ubuntu 16.04 (kernel 4.4, like a few people here have). Can you post which code you're currently using to test this?

marten-seemann commented 11 months ago

I wasn't able to reproduce it with 16.04, only with 14.04.

You can use the example client in the quic-go repo: go run example/client/main.go https://google.com. That should be sufficient to trigger the error.

ToasterDEV commented 9 months ago

Hey there!

Trying to run a DoQ server both with latest release and latest beta ( v0.108.0-b.52) on port 853 within OPNSense 24.1_1 (FreeBSD OPNsense.home 13.2-RELEASE-p9 FreeBSD 13.2-RELEASE-p9 stable/24.1-n254969-8659880248c SMP amd64), but I'm currently running into the same issue.

Trying to run with ECN disabled (sudo QUIC_GO_DISABLE_ECN=true /usr/local/AdGuardHome/AdGuardHome -c /usr/local/AdGuardHome/AdGuardHome.yaml -v) still gives the following output:

2024/02/04 14:08:53.187286 94054#4715 [error] accepting quic stream: INTERNAL_ERROR (local): write udp [::]:853->192.168.1.252:60887: sendmsg: invalid argument
2024/02/04 14:08:53.187354 94054#4715 [debug] closing quic conn 192.168.1.1:853 with code 0

I'm currently testing with kdig and the following command kdig -d +quic -p 853 -t A @192.168.3.1 gitlab.com and the following output:

;; DEBUG: Querying for owner(gitlab.com.), class(1), type(1), server(192.168.3.1), port(853), protocol(UDP)
;; WARNING: QUIC, peer took too long to respond
;; DEBUG: retrying server 192.168.3.1@853(UDP)
;; WARNING: QUIC, peer took too long to respond
;; DEBUG: retrying server 192.168.3.1@853(UDP)
;; WARNING: QUIC, peer took too long to respond
;; ERROR: failed to query server 192.168.3.1@853(UDP)

If there's anything else I can provide to help with debugging, please let me know!

overwatch3560 commented 6 months ago

@Freekers is this still an issue with the most recent version?

Freekers commented 6 months ago

@Freekers is this still an issue with the most recent version?

AFAIK it is

ainar-g commented 6 months ago

As an update, the current upstream issue is quic-go/quic-go#4396.

ainar-g commented 6 months ago

@Freekers, can you check version v0.108.0-a.893+856cc40c on the Edge release channel?

FNsi commented 6 months ago

Any update?

Freekers commented 6 months ago

@Freekers, can you check version v0.108.0-a.893+856cc40c on the Edge release channel?

Thanks, I'll try tomorrow.

Any update?

Sorry, haven't had time yet to try it out...

Freekers commented 6 months ago

@Freekers, can you check version v0.108.0-a.893+856cc40c on the Edge release channel?

I've tested Docker image version v0.108.0-a.896+6dabfb46 because I could not find an older/previous edge image on Docker hub.

I'm happy to report that the issue seems to be resolved! I no longer receive an error during 'Test Upstreams' while using quic upstream servers. I've made sure to remove the QUIC_GO_DISABLE_ECN=true envar from my docker-compose.yml before starting the container. I've also checked the verbose logging and the previously reported errors are now absent there as well.

So yes; the issue seems to be resolved! Thanks so much for collaborating and solving this issue :) @ainar-g @marten-seemann

marten-seemann commented 6 months ago

That's great news! Thank you everyone!

AdguardTeam / AdGuardHome