Closed Freekers closed 6 months ago
Thanks for the report and the logs. I suspect that the kernel version may be the reason for the difference. Can you show the output of uname -a
on both machines?
Also, does adding QUIC_GO_DISABLE_ECN=true
on the machine with the issue fix it?
Thanks for the report and the logs. I suspect that the kernel version may be the reason for the difference. Can you show the output of
uname -a
on both machines?Also, does adding
QUIC_GO_DISABLE_ECN=true
on the machine with the issue fix it?
Output of uname -a
on the working machine:
Linux raptor 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Output of uname -a
on the broken machine:
Linux TurboPolyp 4.4.180+ #42962 SMP Mon May 29 14:38:23 CST 2023 x86_64 GNU/Linux synology_apollolake_918+
Where do I add QUIC_GO_DISABLE_ECN=true
? Is this an environmental variable? If so, I suppose I would need to enter this inside the Docker container, correct?
Thanks
Thanks for the info.
Where do I add
QUIC_GO_DISABLE_ECN=true
? Is this an environmental variable? If so, I suppose I would need to enter this inside the Docker container, correct?
Yes, this should be set in the container's environment. The AdGuardHome
binary should be able to observe the value of that environment variable.
Thanks for the info.
Where do I add
QUIC_GO_DISABLE_ECN=true
? Is this an environmental variable? If so, I suppose I would need to enter this inside the Docker container, correct?Yes, this should be set in the container's environment. The
AdGuardHome
binary should be able to observe the value of that environment variable.
I have set the environment variable inside the container as follows:
docker exec -it adguard sh
/opt/adguardhome/work # echo $QUIC_GO_DISABLE_ECN
/opt/adguardhome/work # export QUIC_GO_DISABLE_ECN=true
/opt/adguardhome/work # echo $QUIC_GO_DISABLE_ECN
true
But sadly it does not fix the issue (same error message). I also tried using the edge image, same issue.
If you're running AGH with something like docker run
, you should use the -e
/--env
.
If you're running AGH with something like
docker run
, you should use the-e
/--env
.
Oops, my bad, you're right. I've now set the environmental variable in my docker-compose file as follows:
services:
adguard:
image: adguard/adguardhome:latest
restart: always
container_name: adguard
network_mode: "host"
environment:
- TZ=Europe/Amsterdam
- QUIC_GO_DISABLE_ECN=true
volumes:
- /volume1/docker/adguard/work:/opt/adguardhome/work
- /volume1/docker/adguard/conf:/opt/adguardhome/conf
I can confirm that the issue is now resolved. The QUIC upstream DNS server now works again, thank you.
What does this setting QUIC_GO_DISABLE_ECN=true
do exactly?
Thanks
What does this setting
QUIC_GO_DISABLE_ECN=true
do exactly?
It disables additional congestion-control features added to quic-go in v0.39.0.
It's good that the workaround works, but it's still weird, as AGH v0.107.41 uses quic-go v0.39.2, which should have fixed the sendmsg: invalid argument
issue. Perhaps Synology has a weird kernel build.
@marten-seemann, is there any way we could debug this further?
What does this setting
QUIC_GO_DISABLE_ECN=true
do exactly?It disables additional congestion-control features added to quic-go in v0.39.0.
@ainar-g It turns out @marten-seemann only patched this for FreeBSD, AMD64 (aka x86_64) Environment. https://github.com/AdguardTeam/AdGuardHome/issues/6301
Users of Asuswrt-Merlin routers are also experiencing this issue:-https://www.snbforums.com/threads/adguardhome-new-releases-2023.85191/post-875540. As a temporary fix, I plan to add the QUIC_GO_DISABLE_ECN=true option to the Env variable PREARGS until adequate fix has been provided.
Here is an example of the environment of Asuswrt-Merlin Routers:
ASUSWRT-Merlin RT-AX88U_PRO 3004.388.4_0 Mon Aug 21 19:34:19 UTC 2023
admin@RT-AX88U_Pro-29B8:/tmp/home/root# uname -a
Linux RT-AX88U_Pro-29B8 4.19.183 #1 SMP PREEMPT Mon Aug 21 15:34:46 EDT 2023 aarch64 ASUSWRT-Merlin
HTH
How would the cmsg look on other platforms? Would be good to fix this in quic-go, the env is just an escape hatch and shouldn’t be a permanent solution.
@jumpsmm7, you're pointing to the FreeBSD issue, but all Linux platforms should have been fixed in #6335. See quic-go/quic-go#4127.
As for the control message, I'm leaning towards this being a change in the Linux kernel somewhere around v5, since so far this seems to affect only those with kernels in the v4.x branch, but I don't have any sold proofs just yet.
@marten-seemann, another theory I've had is that the issue could have something to do with how quic-go sets IP_TOS
/IP6_TCLASS
depending on whether or not an IP address is convertible to IPv4 rather than checking for the socket family. It could also be dependent on sysctl net.ipv6.bindv6only
, although I cannot reproduce any errors either way on my Ubuntu with v5.15.0
kernel. I've seen some C code that just sets both, too, but I'm not sure if that's the correct solution.
I cannot creat an issue in GitHub mobile client, all be teleport to discord.
That's the same thing I face in my old arm-v7 android device. (uname -a linux 3.4.39 armv7)
So this might just be due to ancient kernels. Is anyone aware of a way to detect support for these cmsgs, ideally without parsing kernel version numbers?
@marten-seemann, my guess would be that getting this EINVAL
is the way. Perhaps, the code should send the message with the ECN data, check if the error is EINVAL
, and, if it is, retry sending without the ECN data. If that second send succeeds, ECN is likely not supported in the kernel.
Also, as a related question, are there any plans to allow library clients to disable ECN through the Config
structure? Using setenv
to configure a library isn't exactly ideal, and there may be some clients who want to disable the feature regardless of the support.
@marten-seemann, my guess would be that getting this
EINVAL
is the way. Perhaps, the code should send the message with the ECN data, check if the error isEINVAL
, and, if it is, retry sending without the ECN data. If that second send succeeds, ECN is likely not supported in the kernel.
We already have similar logic for GSO: https://github.com/quic-go/quic-go/blob/3bf2e19d0dc617135ec9d6f3c5191740a27097c7/send_conn.go#L62-L68. I assume we could build something similar for EINVAL
, but it's a bit unfortunate too much such an unspecific error code.
Also, as a related question, are there any plans to allow library clients to disable ECN through the
Config
structure? Usingsetenv
to configure a library isn't exactly ideal, and there may be some clients who want to disable the feature regardless of the support.
What's the use case for that?
What's the use case for that?
Situations where the developers know that the software is likely to be run on older/modified kernels without proper ECN support.
I can confirm that HTTP/3 doesn't work in Synology Docker under v0.107.43 Setting the env variable as advised above resolved the issue.
$ uname -a
Linux DS920 4.4.59+ #25556 SMP PREEMPT Tue Mar 21 22:25:44 CST 2023 x86_64 GNU/Linux synology_geminilake_920+
2023/12/11 20:41:36.202227 1#47 [debug] dnsproxy: https://cloudflare-dns.com:443/dns-query: response received over udp: "requesting https://cloudflare-dns.com:443/dns-query: Get_0rtt \"https://cloudflare-dns.com:443/dns-query?dns=AAABAAABAAAAAAAABHRlc3QAAAEAAQ\": INTERNAL_ERROR (local): write udp [::]:40657->104.16.248.249:443: sendmsg: invalid argument"
I'd need some more hints debugging this. It's really hard to make any fixes if I can't reproduce this locally.
I already installed Ubuntu 18.04 in a VM (4.15.0-213-generic
on aarch64), but everything works fine here.
Can someone try reproducing it in Ubuntu 16.04 that has 4.4 kernel? https://wiki.ubuntu.com/XenialXerus/ReleaseNotes
I guess it should be a problem earlier than 4.14 Mine is 3.4. Others in this issue are 4.4. , 4.5.
I'm unable install Ubuntu 16.04 due to some weird virtualization errors, both in UTM and in Parallels. The earliest version I can install is 18.04.
I managed to run Ubuntu 14.04 and the "sendmsg: invalid argument" reproduces there.
It looks like the change we introduced in response to https://github.com/AdguardTeam/AdGuardHome/issues/6335 is causing the issue: If I use a 4 byte value for the IP_TOS cmsg, it works on old kernels (despite man 7 ip
claiming that IP_TOS is a byte and not a uint32).
Re-reading #6335 I'm not sure anymore why we reduced the cmsg value to 1, other than to be more conformant with what the man page says. Newer versions of Linux seem to accept both values. I'm planning to revert the change (https://github.com/quic-go/quic-go/pull/4127), unless someone has a better idea how to fix this problem.
@marten-seemann, what about this comment? The original reason wasn't just to follow the manual but also because the size was causing reproducible issues that went away after the change to 1.
I wasn't able to reproduce this failure. Maybe it only occurs on MIPS? Frankly, properly supporting amd64 and arm64 on all kernel versions is more important than other architectures, and we could disable ECN on mips altogether.
It's definitely not MIPS-only, because I ran the test on a machine running AMD64, and so did a lot of people for whom size 1 fixed that issue.
Considering that 4 is the size of an IPV6_TCLASS
message, are you sure that the issue isn't that an IPv6 socket is receiving mapped IPv4 queries and thus there is a protocol mismatch, as I've described previously? Judging by some questions (like this and this), it was one of the things that had changed between 16.04 and 18.04.
Yes. Please try out 14.04, size 1 fails there reliably, whereas size 4 works reliably. Size 1 seems to continue causing problems, see https://github.com/quic-go/quic-go/issues/4178 for example.
I'm getting no errors with our dnsproxy
(using quic-go@v0.39.1) and a QUIC upstream on qemu with Ubuntu 16.04 (kernel 4.4, like a few people here have). Can you post which code you're currently using to test this?
I wasn't able to reproduce it with 16.04, only with 14.04.
You can use the example client in the quic-go repo: go run example/client/main.go https://google.com
. That should be sufficient to trigger the error.
Hey there!
Trying to run a DoQ server both with latest release and latest beta ( v0.108.0-b.52) on port 853 within OPNSense 24.1_1 (FreeBSD OPNsense.home 13.2-RELEASE-p9 FreeBSD 13.2-RELEASE-p9 stable/24.1-n254969-8659880248c SMP amd64), but I'm currently running into the same issue.
Trying to run with ECN disabled (sudo QUIC_GO_DISABLE_ECN=true /usr/local/AdGuardHome/AdGuardHome -c /usr/local/AdGuardHome/AdGuardHome.yaml -v
) still gives the following output:
2024/02/04 14:08:53.187286 94054#4715 [error] accepting quic stream: INTERNAL_ERROR (local): write udp [::]:853->192.168.1.252:60887: sendmsg: invalid argument
2024/02/04 14:08:53.187354 94054#4715 [debug] closing quic conn 192.168.1.1:853 with code 0
I'm currently testing with kdig and the following command kdig -d +quic -p 853 -t A @192.168.3.1 gitlab.com
and the following output:
;; DEBUG: Querying for owner(gitlab.com.), class(1), type(1), server(192.168.3.1), port(853), protocol(UDP)
;; WARNING: QUIC, peer took too long to respond
;; DEBUG: retrying server 192.168.3.1@853(UDP)
;; WARNING: QUIC, peer took too long to respond
;; DEBUG: retrying server 192.168.3.1@853(UDP)
;; WARNING: QUIC, peer took too long to respond
;; ERROR: failed to query server 192.168.3.1@853(UDP)
If there's anything else I can provide to help with debugging, please let me know!
@Freekers is this still an issue with the most recent version?
@Freekers is this still an issue with the most recent version?
AFAIK it is
As an update, the current upstream issue is quic-go/quic-go#4396.
@Freekers, can you check version v0.108.0-a.893+856cc40c on the Edge release channel?
Any update?
@Freekers, can you check version v0.108.0-a.893+856cc40c on the Edge release channel?
Thanks, I'll try tomorrow.
Any update?
Sorry, haven't had time yet to try it out...
@Freekers, can you check version v0.108.0-a.893+856cc40c on the Edge release channel?
I've tested Docker image version v0.108.0-a.896+6dabfb46 because I could not find an older/previous edge image on Docker hub.
I'm happy to report that the issue seems to be resolved! I no longer receive an error during 'Test Upstreams' while using quic upstream servers. I've made sure to remove the QUIC_GO_DISABLE_ECN=true
envar from my docker-compose.yml before starting the container. I've also checked the verbose logging and the previously reported errors are now absent there as well.
So yes; the issue seems to be resolved! Thanks so much for collaborating and solving this issue :) @ainar-g @marten-seemann
That's great news! Thank you everyone!
Prerequisites
[X] I have checked the Wiki and Discussions and found no answer
[X] I have searched other issues and found no duplicates
[X] I want to report a bug and not ask a question or ask for help
[X] I have set up AdGuard Home correctly and configured clients to use it. (Use the Discussions for help with installing and configuring clients.)
Platform (OS and CPU architecture)
Custom (please mention in the description)
Installation
Docker
Setup
Other (please mention in the description)
AdGuard Home version
v0.107.41
Action
Click 'Test Upstreams'
Expected result
Confirmation that the upstream server is working correctly.
Actual result
Server "quic://XXXXX.dns.nextdns.io": could not be used, please check that you've written it correctly
Additional information and/or screenshots
I'm running two AGH instances. After updating both instances from v0.107.40 to v0.107.41, one instance works fine but on the other one upstream DNS-over-QUIC servers no longer work. The error displayed is: Server "quic://XXXXX.dns.nextdns.io": could not be used, please check that you've written it correctly I also tried using the QUIC server of Adguard but the issue is the same.
Both instances run on Docker. However, the host OS is different. The working instance runs Ubuntu Server 22.04. The broken/non-working instance is running on a Synology NAS (x86_64 GNU/Linux synology_apollolake_918+) I've already deleted the container and repulled the image, but the problem is still there. This DNS-over-QUIC upstream server was working on both instances on v0.107.40
I enabled debug logging and found the following which could be related;
This issue seems related to: https://github.com/AdguardTeam/AdGuardHome/issues/6301 and https://github.com/AdguardTeam/AdGuardHome/issues/6335 which was resolved in v0.107.40