alpinelinux / docker-alpine

Official Alpine Linux Docker image. Win at minimalism!
MIT License
1.04k stars 261 forks source link

cURL IPv4 issue since alpine `3.19` #366

Open niconoe- opened 6 months ago

niconoe- commented 6 months ago

Hi, and thank you for your awesome work!

I'm experiencing an issue with alpine 3.19 when using curl: it seems that curl only tries to match IPv6 rather than being able to switch to the right IP version to connect. The thing is this doesn't look like to come from cURL itself, as on alpine 3.18, it works like a charm.

How to reproduce

Alpine 3.19 (curl classic)

$ > docker run --rm -it --entrypoint=/bin/sh alpine:3.19
# In container:
/ > apk add curl
# fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
# fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/x86_64/APKINDEX.tar.gz
# (1/8) Installing ca-certificates (20230506-r0)
# (2/8) Installing brotli-libs (1.1.0-r1)
# (3/8) Installing c-ares (1.22.1-r0)
# (4/8) Installing libunistring (1.1-r2)
# (5/8) Installing libidn2 (2.3.4-r4)
# (6/8) Installing nghttp2-libs (1.58.0-r0)
# (7/8) Installing libcurl (8.5.0-r0)
# (8/8) Installing curl (8.5.0-r0)
# Executing busybox-1.36.1-r15.trigger
# Executing ca-certificates-20230506-r0.trigger
# OK: 12 MiB in 23 packages
/ > curl www.google.com
# curl: (7) Failed to connect to www.google.com port 80 after 2003 ms: Couldn't connect to server

Alpine 3.19 (curl with --ipv4 option)

$ > docker run --rm -it --entrypoint=/bin/sh alpine:3.19
# In container:
/ > apk add curl
# fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
# fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/x86_64/APKINDEX.tar.gz
# (1/8) Installing ca-certificates (20230506-r0)
# (2/8) Installing brotli-libs (1.1.0-r1)
# (3/8) Installing c-ares (1.22.1-r0)
# (4/8) Installing libunistring (1.1-r2)
# (5/8) Installing libidn2 (2.3.4-r4)
# (6/8) Installing nghttp2-libs (1.58.0-r0)
# (7/8) Installing libcurl (8.5.0-r0)
# (8/8) Installing curl (8.5.0-r0)
# Executing busybox-1.36.1-r15.trigger
# Executing ca-certificates-20230506-r0.trigger
# OK: 12 MiB in 23 packages
/ > curl --ipv4 www.google.com
# <html>…</html> # The Google home page

Alpine 3.18 (curl classic)

$ > docker run --rm -it --entrypoint=/bin/sh alpine:3.18
# In container:
/ > apk add curl
# fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/x86_64/APKINDEX.tar.gz
# fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/x86_64/APKINDEX.tar.gz
# (1/7) Installing ca-certificates (20230506-r0)
# (2/7) Installing brotli-libs (1.0.9-r14)
# (3/7) Installing libunistring (1.1-r1)
# (4/7) Installing libidn2 (2.3.4-r1)
# (5/7) Installing nghttp2-libs (1.57.0-r0)
# (6/7) Installing libcurl (8.5.0-r0)
# (7/7) Installing curl (8.5.0-r0)
# Executing busybox-1.36.1-r5.trigger
# Executing ca-certificates-20230506-r0.trigger
# OK: 12 MiB in 22 packages
/ > curl www.google.com
# <html>…</html> # The Google home page

As you can see, the curl version is exactly the same between all tests (8.5.0-r0) but still, there's a difference between Alpine 3.18 and Alpine 3.19.

I expect the curl command from Alpine 3.19 to work as expected without the requiring need to force ipv4.

If you need more info, fell free to ask. Thanks a lot

EDIT: after a very short investigation, I can see that Alpine 3.19 is now adding c-ares=1.22.1-r0 as a dependency of cURL, and I just discovered that https://github.com/c-ares/c-ares/issues/652 could be related: AFAIK, when cURL tries to resolve the DNS, it tries with both IPv4 or IPv6 by default, and takes the faster match. c-ares is here to help cURL doing that in parallel so that the DNS resolution between both IPv4 or IPv6 is parallelized, resulting in faster cURL calls. But with the issue I just linked above, it looks like when multiple DNS researches are given and one fail, it gives the failure status definitly. Therefore, as the IPv6 fails faster than IPv4 is resolved, c-ares wrongly says to cURL that the host is unreachable. I'm not 100% sure about this, but I think it deserves to take a look. I wasn't able to remove c-ares and give it a try without it.

bradh352 commented 6 months ago

The description isn't exactly accurate about the behavior. See https://github.com/c-ares/c-ares/pull/551 for a better description, but basically a change was made in c-ares 1.20.0 to not go through the entire timeout sequence if we had at least a partial reply as it is very likely that it won't work. It still waits for the other address family to timeout or have some other issue on the current request. So if someone has tries=3, timeout=2s and 2 dns servers, it could take a minimum of 3*2*2 = 12 seconds (its actually more as there's an additional penalty per retry to the same server), vs if one address class returned in 100ms, it would take at most 2s to return the partial result since it would terminate the other address family's additional attempts.

Now, there apparently have been reported issues to glibc that does something similar to this as per https://man7.org/linux/man-pages/man5/resolv.conf.5.html:

single-request (since glibc 2.10)
                     Sets RES_SNGLKUP in _res.options.  By default,
                     glibc performs IPv4 and IPv6 lookups in parallel
                     since glibc 2.9.  Some appliance DNS servers cannot
                     handle these queries properly and make the requests
                     time out.  This option disables the behavior and
                     makes glibc perform the IPv6 and IPv4 requests
                     sequentially (at the cost of some slowdown of the
                     resolving process).

So likely before c-ares 1.20.0, the retries allowed this to eventually succeed in such an environment. Currently c-ares doesn't honor the glibc single-request option.

It would probably be good to know if this is what is really happening in your environment, a tcpdump/pcap would be useful. You should probably open a ticket in https://github.com/c-ares/c-ares/issues with your findings.

bradh352 commented 6 months ago

I should also mention that we just added alpine linux automated (CI/CD) testing to c-ares to ensure there are no behavioral differences (e.g. due to musl c). All tests are passing, so I'm pretty sure whatever you are experiencing is outside of alpine's scope.

niconoe- commented 6 months ago

Thank you very much for your answers.

On my side, I'm not that advanced on networking so I'm not 100% sure I could handle this. I'll give it a try by looking at tcpdump and pcap. I really do understand that your automated tests prevents you from releasing something buggy, and I'm glad that's how it works!

Nevertheless, I'm curious about the result you got when attempting to simply try to reproduce my commands. Did it actually work for you? I mean, when running from Docker containers, I expect almost nothing is fetch from my local environment as I thought containers are mainly isolated. I'm aware that core libs from native OS are used, of course, but I wouldn't expect any difference between my OS, a canonical alpine 3.18 from this OS and a canonical alpine 3.19 from this exact same OS, as what's imported in the containers to make it work seems highly generic and kernel-related to me.

Anyway, thanks to IT colleagues I'll ask and your advices, I'll try to investigate as much as I can to identify the reasons I'm experiancing such issue.

bradh352 commented 6 months ago

I haven't tried your exact scenario (using curl), just building and running the c-ares test suite on alpine linux.

Tithugues commented 6 months ago

I tried and indeed, I've the same issue:

$ docker run --rm -it --entrypoint=/bin/sh alpine:3.19 -c "apk add curl && curl --trace - --trace-time www.google.com"
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/x86_64/APKINDEX.tar.gz
(1/8) Installing ca-certificates (20230506-r0)
(2/8) Installing brotli-libs (1.1.0-r1)
(3/8) Installing c-ares (1.22.1-r0)
(4/8) Installing libunistring (1.1-r2)
(5/8) Installing libidn2 (2.3.4-r4)
(6/8) Installing nghttp2-libs (1.58.0-r0)
(7/8) Installing libcurl (8.5.0-r0)
(8/8) Installing curl (8.5.0-r0)
Executing busybox-1.36.1-r15.trigger
Executing ca-certificates-20230506-r0.trigger
OK: 12 MiB in 23 packages
14:14:49.612106 == Info: Host www.google.com:80 was resolved.
14:14:49.612159 == Info: IPv6: 2a00:1450:4001:80b::2004
14:14:49.612165 == Info: IPv4: (none)
14:14:49.612212 == Info:   Trying [2a00:1450:4001:80b::2004]:80...
14:14:49.612242 == Info: Immediate connect fail for 2a00:1450:4001:80b::2004: Address not available
14:14:49.612258 == Info: Failed to connect to www.google.com port 80 after 2002 ms: Couldn't connect to server
14:14:49.612268 == Info: Closing connection
curl: (7) Failed to connect to www.google.com port 80 after 2002 ms: Couldn't connect to server
bradh352 commented 6 months ago

Well, I am running the current c-ares main, not v1.22 which is a couple release behind (current release is v1.24). Perhaps there is some issue in v1.22 ?

Anyhow, in our current c-ares CI system, this is the latest alpine build with tests: https://api.cirrus-ci.com/v1/task/4971237735137280/logs/main.log

If you search for ./ci/test.sh this is where the tests start, the first test is running adig which is similar to BIND's dig and ends with ;; MSG SIZE, then immediately after that is the output of ahost www.google.com and you can see it returns both IPv4 and IPv6 addresses:

www.google.com                      142.250.1.104
www.google.com                      142.250.1.99
www.google.com                      142.250.1.105
www.google.com                      142.250.1.106
www.google.com                      142.250.1.147
www.google.com                      142.250.1.103
www.google.com                      2607:f8b0:4001:c09::68
www.google.com                      2607:f8b0:4001:c09::69
www.google.com                      2607:f8b0:4001:c09::67
www.google.com                      2607:f8b0:4001:c09::63

In theory, that's exactly what curl should see as curl should internally be using the same function as ahost does (ares_getaddrinfo), and we can see both ipv4 and ipv6 addresses. That said, I don't know what alpine test environment you're using, as it could very well be environmental with what DNS servers you are using.

Everything after that point is just running the whole test suite.

niconoe- commented 6 months ago

On my side, I just gave it a try today with tcpdump, here are the results:

Preparation

$ > docker run --rm -it --entrypoint=/bin/sh alpine:3.19
# In container:
/ > apk add curl tcpdump
# Downloading…

Logging shell

# Display verbosly with hexadecimal content representation, with IP and port on interface "eth0" (default one on Docker container) where source or destination is my current IP:
/ > tcpdump -vvXnni eth0 src $(hostname -i) or dst $(hostname -i)
# tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes

Calling cURL shell

/ > curl --trace - --trace-time www.google.com

Logging shell updating…

# 16:09:59.631641 IP (tos 0x0, ttl 64, id 18664, offset 0, flags [DF], proto UDP (17), length 71)
#     10.1.128.2.39924 > 172.18.21.134.53: [bad udp cksum 0x4be0 -> 0x4f44!] 4305+ [1au] A? www.google.com. ar: . OPT UDPsize=1280 (43)
#     0x0000:  4500 0047 48e8 4000 4011 a622 0a01 8002  E..GH.@.@.."....
#     0x0010:  ac12 1586 9bf4 0035 0033 4be0 10d1 0100  .......5.3K.....
#     0x0020:  0001 0000 0000 0001 0377 7777 0667 6f6f  .........www.goo
#     0x0030:  676c 6503 636f 6d00 0001 0001 0000 2905  gle.com.......).
#     0x0040:  0000 0000 0000 00                        .......
# 16:09:59.631696 IP (tos 0xc0, ttl 64, id 990, offset 0, flags [none], proto ICMP (1), length 99)
#     172.18.21.134 > 10.1.128.2: ICMP 172.18.21.134 udp port 53 unreachable, length 79
#     IP (tos 0x0, ttl 64, id 18664, offset 0, flags [DF], proto UDP (17), length 71)
#     10.1.128.2.39924 > 172.18.21.134.53: [bad udp cksum 0x4be0 -> 0x4f44!] 4305+ [1au] A? www.google.com. ar: . OPT UDPsize=1280 (43)
#     0x0000:  45c0 0063 03de 0000 4001 2a61 ac12 1586  E..c....@.*a....
#     0x0010:  0a01 8002 0303 4c41 0000 0000 4500 0047  ......LA....E..G
#     0x0020:  48e8 4000 4011 a622 0a01 8002 ac12 1586  H.@.@.."........
#     0x0030:  9bf4 0035 0033 4be0 10d1 0100 0001 0000  ...5.3K.........
#     0x0040:  0000 0001 0377 7777 0667 6f6f 676c 6503  .....www.google.
#     0x0050:  636f 6d00 0001 0001 0000 2905 0000 0000  com.......).....
#     0x0060:  0000 00                                  ...
# 16:09:59.631737 IP (tos 0x0, ttl 64, id 48319, offset 0, flags [DF], proto UDP (17), length 71)
#     10.1.128.2.50287 > 172.18.86.200.53: [bad udp cksum 0x8d22 -> 0xfbd0!] 64107+ [1au] AAAA? www.google.com. ar: . OPT UDPsize=1280 (43)
#     0x0000:  4500 0047 bcbf 4000 4011 f108 0a01 8002  E..G..@.@.......
#     0x0010:  ac12 56c8 c46f 0035 0033 8d22 fa6b 0100  ..V..o.5.3.".k..
#     0x0020:  0001 0000 0000 0001 0377 7777 0667 6f6f  .........www.goo
#     0x0030:  676c 6503 636f 6d00 001c 0001 0000 2905  gle.com.......).
#     0x0040:  0000 0000 0000 00                        .......
# 16:09:59.633326 IP (tos 0x0, ttl 124, id 58881, offset 0, flags [none], proto UDP (17), length 99)
#     172.18.86.200.53 > 10.1.128.2.50287: [udp sum ok] 64107 q: AAAA? www.google.com. 1/0/1 www.google.com. AAAA 2a00:1450:4001:806::2004 ar: . OPT UDPsize=4000 (71)
#     0x0000:  4500 0063 e601 0000 7c11 cbaa ac12 56c8  E..c....|.....V.
#     0x0010:  0a01 8002 0035 c46f 004f 7427 fa6b 8180  .....5.o.Ot'.k..
#     0x0020:  0001 0001 0000 0001 0377 7777 0667 6f6f  .........www.goo
#     0x0030:  676c 6503 636f 6d00 001c 0001 c00c 001c  gle.com.........
#     0x0040:  0001 0000 0050 0010 2a00 1450 4001 0806  .....P..*..P@...
#     0x0050:  0000 0000 0000 2004 0000 290f a000 0000  ..........).....
#     0x0060:  0000 00                                  ...

Calling cURL shell updating…

# 16:10:01.692831 == Info: Host www.google.com:80 was resolved.
# 16:10:01.692920 == Info: IPv6: 2a00:1450:4001:806::2004
# 16:10:01.692949 == Info: IPv4: (none)
# 16:10:01.693032 == Info:   Trying [2a00:1450:4001:806::2004]:80...
# 16:10:01.693118 == Info: Immediate connect fail for 2a00:1450:4001:806::2004: Address not available
# 16:10:01.693162 == Info: Failed to connect to www.google.com port 80 after 2003 ms: Couldn't connect to server
# 16:10:01.693196 == Info: Closing connection
# curl: (7) Failed to connect to www.google.com port 80 after 2003 ms: Couldn't connect to server

Logging shell updating…

# 16:10:04.765537 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.1.128.1 tell 10.1.128.2, length 28
#     0x0000:  0001 0800 0604 0001 0242 0a01 8002 0a01  .........B......
#     0x0010:  8002 0000 0000 0000 0a01 8001            ............
# 16:10:04.765607 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.1.128.2 tell 10.1.128.1, length 28
#     0x0000:  0001 0800 0604 0001 0242 0c35 ad1f 0a01  .........B.5....
#     0x0010:  8001 0000 0000 0000 0a01 8002            ............
# 16:10:04.765616 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.1.128.2 is-at 02:42:0a:01:80:02, length 28
#     0x0000:  0001 0800 0604 0002 0242 0a01 8002 0a01  .........B......
#     0x0010:  8002 0242 0c35 ad1f 0a01 8001            ...B.5......
# 16:10:04.765680 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.1.128.1 is-at 02:42:0c:35:ad:1f, length 28
#     0x0000:  0001 0800 0604 0002 0242 0c35 ad1f 0a01  .........B.5....
#     0x0010:  8001 0242 0a01 8002 0a01 8002            ...B........

The only thing I can suspect is ICMP 172.18.21.134 udp port 53 unreachable, but I really don't know why, neither why it works very well when specifying to use IPv4 durint the cURL call…

I'll check with my IT dept. people working on the DNS configuration, maybe :shrug:

bradh352 commented 6 months ago

172.18.21.134 received an ICMP unreachable reply, and 172.18.86.200 works. That said, it's not immediately clear to me why the A request went to 172.18.21.134 and the AAAA request went to 172.18.86.200. Can you share your /etc/resolv.conf ? I wonder if you have rotate enabled for the dns servers.

bradh352 commented 6 months ago

Is that really the entirety of the tcp dump? Typically an event should be received on an ICMP unreachable which then recv() would be called and then detect the udp destination isn't valid, so we should have seen another "A" record request go out, especially considering the timings shown here.

niconoe- commented 6 months ago

172.18.21.134 received an ICMP unreachable reply, and 172.18.86.200 works. That said, it's not immediately clear to me why the A request went to 172.18.21.134 and the AAAA request went to 172.18.86.200. Can you share your /etc/resolv.conf ? I wonder if you have rotate enabled for the dns servers.

The /etc/resolv.conf file is containing this

nameserver 127.0.0.1
nameserver 172.18.86.200
nameserver 172.18.32.204
nameserver 172.18.86.207
options edns0 trust-ad
search ad.XXXXX.com # My company's AD

Is that really the entirety of the tcp dump? Typically an event should be received on an ICMP unreachable which then recv() would be called and then detect the udp destination isn't valid, so we should have seen another "A" record request go out, especially considering the timings shown here.

That was the full tcp dump I got when filtering on my IP address. Here is the result without the filter, trying not to be too much polluted:

/ > tcpdump -vvXnni any
# tcpdump: data link type LINUX_SLL2
# tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
# 17:18:01.248975 eth0  Out IP (tos 0x0, ttl 64, id 10392, offset 0, flags [DF], proto UDP (17), length 71)
#     10.1.128.2.34956 > 172.18.21.134.53: [bad udp cksum 0x4be0 -> 0x4ecf!] 9390+ [1au] A? www.google.com. ar: . OPT UDPsize=1280 (43)
#     0x0000:  4500 0047 2898 4000 4011 c672 0a01 8002  E..G(.@.@..r....
#     0x0010:  ac12 1586 888c 0035 0033 4be0 24ae 0100  .......5.3K.$...
#     0x0020:  0001 0000 0000 0001 0377 7777 0667 6f6f  .........www.goo
#     0x0030:  676c 6503 636f 6d00 0001 0001 0000 2905  gle.com.......).
#     0x0040:  0000 0000 0000 00                        .......
# 17:18:01.249021 eth0  In  IP (tos 0xc0, ttl 64, id 25710, offset 0, flags [none], proto ICMP (1), length 99)
#     172.18.21.134 > 10.1.128.2: ICMP 172.18.21.134 udp port 53 unreachable, length 79
#     IP (tos 0x0, ttl 64, id 10392, offset 0, flags [DF], proto UDP (17), length 71)
#     10.1.128.2.34956 > 172.18.21.134.53: [bad udp cksum 0x4be0 -> 0x4ecf!] 9390+ [1au] A? www.google.com. ar: . OPT UDPsize=1280 (43)
#     0x0000:  45c0 0063 646e 0000 4001 c9d0 ac12 1586  E..cdn..@.......
#     0x0010:  0a01 8002 0303 4bcc 0000 0000 4500 0047  ......K.....E..G
#     0x0020:  2898 4000 4011 c672 0a01 8002 ac12 1586  (.@.@..r........
#     0x0030:  888c 0035 0033 4be0 24ae 0100 0001 0000  ...5.3K.$.......
#     0x0040:  0000 0001 0377 7777 0667 6f6f 676c 6503  .....www.google.
#     0x0050:  636f 6d00 0001 0001 0000 2905 0000 0000  com.......).....
#     0x0060:  0000 00                                  ...
# 17:18:01.249058 eth0  Out IP (tos 0x0, ttl 64, id 4793, offset 0, flags [DF], proto UDP (17), length 71)
#     10.1.128.2.57608 > 172.18.86.200.53: [bad udp cksum 0x8d22 -> 0x7146!] 26717+ [1au] AAAA? www.google.com. ar: . OPT UDPsize=1280 (43)
#     0x0000:  4500 0047 12b9 4000 4011 9b0f 0a01 8002  E..G..@.@.......
#     0x0010:  ac12 56c8 e108 0035 0033 8d22 685d 0100  ..V....5.3."h]..
#     0x0020:  0001 0000 0000 0001 0377 7777 0667 6f6f  .........www.goo
#     0x0030:  676c 6503 636f 6d00 001c 0001 0000 2905  gle.com.......).
#     0x0040:  0000 0000 0000 00                        .......
# 17:18:01.251345 eth0  In  IP (tos 0x0, ttl 124, id 887, offset 0, flags [none], proto UDP (17), length 99)
#     172.18.86.200.53 > 10.1.128.2.57608: [udp sum ok] 26717 q: AAAA? www.google.com. 1/0/1 www.google.com. AAAA 2a00:1450:4001:80b::2004 ar: . OPT UDPsize=4000 (71)
#     0x0000:  4500 0063 0377 0000 7c11 ae35 ac12 56c8  E..c.w..|..5..V.
#     0x0010:  0a01 8002 0035 e108 004f e9a0 685d 8180  .....5...O..h]..
#     0x0020:  0001 0001 0000 0001 0377 7777 0667 6f6f  .........www.goo
#     0x0030:  676c 6503 636f 6d00 001c 0001 c00c 001c  gle.com.........
#     0x0040:  0001 0000 0047 0010 2a00 1450 4001 080b  .....G..*..P@...
#     0x0050:  0000 0000 0000 2004 0000 290f a000 0000  ..........).....
#     0x0060:  0000 00                                  ...
# 17:18:06.429450 eth0  Out ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.1.128.1 tell 10.1.128.2, length 28
#     0x0000:  0001 0800 0604 0001 0242 0a01 8002 0a01  .........B......
#     0x0010:  8002 0000 0000 0000 0a01 8001            ............
# 17:18:06.429507 eth0  In  ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.1.128.2 tell 10.1.128.1, length 28
#     0x0000:  0001 0800 0604 0001 0242 0c35 ad1f 0a01  .........B.5....
#     0x0010:  8001 0000 0000 0000 0a01 8002            ............
# 17:18:06.429513 eth0  Out ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.1.128.2 is-at 02:42:0a:01:80:02, length 28
#     0x0000:  0001 0800 0604 0002 0242 0a01 8002 0a01  .........B......
#     0x0010:  8002 0242 0c35 ad1f 0a01 8001            ...B.5......
# 17:18:06.429547 eth0  In  ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.1.128.1 is-at 02:42:0c:35:ad:1f, length 28
#     0x0000:  0001 0800 0604 0002 0242 0c35 ad1f 0a01  .........B.5....
#     0x0010:  8001 0242 0a01 8002 0a01 8002            ...B........

In parallel, I also tried to call curl -I -6 www.google.com directly from my machine (not from any container), and I got an error that host is unreachable too. For some reasons, it looks like I can't make any IPv6 calls even if I checked with every commands and looking every config file I know IPv6 is enabled. But still, even if IPv6 is buggy on my machine, if I run a cURL command without specifying the IPv4 or IPv6 option, it should manage to run both and ignores the failures if any succees comes, right?

Plus, even if that's my IPv6 configuration on my machine which is wrongly set, it doesn't explain why the curl command works on alpine 3.18 but doesn't on alpine 3.19.

I'm a bit lost tbh, so thank you very much for your help about that!

bradh352 commented 6 months ago

Ok, well that's even more interesting. That means the 10.1.128.2.34956 > 172.18.21.134.53 wasn't generated by c-ares at all, but from your local resolver at 127.0.0.1, as 172.18.21.134 isn't listed in your /etc/resolv.conf at all so there's no way c-ares would try to use that. Can you tcpdump all interfaces on port 53 udp on the machine and try again? I'd expect "lo" listed as an interface with port 53 traffic.

niconoe- commented 6 months ago

When I run sudo tcpdump -vvXnni any port 53 on my machine, I an over-polluted by other docker services currently running for my development workspace.

I'm cleaning up all of them and I'll try again.

After cleaning up, you're right, I can see lots of traces on "lo" interface with port 53 traffic. I can't display it as is because it contains sensitive information and domains from my company, but it looks like

19:08:12.115747 lo    In  IP (tos 0x0, ttl 64, id 13846, offset 0, flags [DF], proto UDP (17), length 82)
    127.0.0.1.53 > 127.0.0.1.42736: [bad udp cksum 0xfe51 -> 0xe9bd!] 53037 q: AAAA? xxxxxxxxx01.ad.xxxxxx.com. 0/0/1 ar: . OPT UDPsize=1280 (54)
    0x0000:  4500 0052 3616 4000 4011 0683 7f00 0001  E..R6.@.@.......
    0x0010:  7f00 0001 0035 a6f0 003e fe51 cf2d 8180  .....5...>.Q.-..
    0x0020:  0001 0000 0000 0001 0b78 7878 7878 7878  .........xxxxxxx
    0x0030:  7878 3031 0261 6406 7878 7878 7878 0363  xx01.ad.xxxxxx.c
    0x0040:  6f6d 0000 1c00 0100 0029 0500 0000 0000  om.......)......
    0x0050:  0000                                     ..

Do those calls may interfere with the curl requests I'm trying to do on my containers?

niconoe- commented 6 months ago

I continued to investigate and I found something quite interesting IMO.

I think the issue comes from the fact I can't use IPv6, neither on my machine nor on any container it hosts. But I also think that something could be improved in c-ares to remediate to such issue.

I managed to simplify my tests to highlight only the important things, so here are my runs:

# Inside a container from image alpine:3.19, on which I added `curl` via `apk add curl`.
/ > curl -Ivvv www.google.com
# * Host www.google.com:80 was resolved.
# * IPv6: 2a00:1450:4025:401::69, 2a00:1450:4025:401::6a, 2a00:1450:4025:401::67, 2a00:1450:4025:401::93
# * IPv4: (none)
# *   Trying [2a00:1450:4025:401::69]:80...
# * Immediate connect fail for 2a00:1450:4025:401::69: Address not available
# *   Trying [2a00:1450:4025:401::6a]:80...
# * Immediate connect fail for 2a00:1450:4025:401::6a: Address not available
# *   Trying [2a00:1450:4025:401::67]:80...
# * Immediate connect fail for 2a00:1450:4025:401::67: Address not available
# *   Trying [2a00:1450:4025:401::93]:80...
# * Immediate connect fail for 2a00:1450:4025:401::93: Address not available
# * Failed to connect to www.google.com port 80 after 2003 ms: Couldn't connect to server
# * Closing connection
# curl: (7) Failed to connect to www.google.com port 80 after 2003 ms: Couldn't connect to server

/ > curl -Ivvv4 www.google.com
# * Host www.google.com:80 was resolved.
# * IPv6: (none)
# * IPv4: 142.250.27.105, 142.250.27.106, 142.250.27.99, 142.250.27.103, 142.250.27.147, 142.250.27.104
# *   Trying 142.250.27.105:80...
# * Connected to www.google.com (142.250.27.105) port 80
# > HEAD / HTTP/1.1
# > Host: www.google.com
# > User-Agent: curl/8.5.0
# Blablabla… the response is OK

To me, that means that when I first ran curl without specifying the IP version to use, c-ares is trying to resolve the domain name on both IPv6 or IPv4 and stops as soon as one resolution as been found. This is the interest of c-ares to run faster DNS resolution as no need to continue resolving if already resolved, even in another format. And on the first command I ran, you can see that IPv6 has been resolved, but not IPv4 (IPv4: (none)). On the second command I run, I force the usage of IPv4, and hopefully, c-ares understands that and only tries to resolve the domain name on IPv4 format, and it manages to do it.

To me, that means 2 issues:

  1. on my local configuration, it should work using IPv6, as it's enabled
  2. c-ares shouldn't take too much insurance on the fact the DNS resolution is done. It's not because the DNS is resolved (no matter in IPv4 or IPv6) that means the host can reach it. Or, if that's the golden goal of c-ares, maybe curl shouldn't use c-ares blindly and must fallback.

I'll investigate more to make IPv6 work on my environment, and this should solve my issue, but I highly suspect other people to have wrongly set configurations too encountering the problem that curl fails because it trusts c-ares playing the lazy-guy and not checking deeply the reachability of the IP by the host.

PS: Looking at c-ares changelogs, I think this issue might be actually solved on v1.24, but as the alpine:3.19 sticks with c-ares 1.22, I can't test it further.

bradh352 commented 6 months ago

Its impossible to tell what is going on with your system with the information provided. The real issue is you have a local dns resolver running at 127.0.0.1 and other servers configured. We can't tell from what you've provide what c-ares is doing vs your local resolver.

I don't believe your conclusion is accurate based on the information at hand. Really you either need to remove your local resolver from /etc/resolv.conf and test ... or remove all other dns servers and leave only the local resolver.

aptalca commented 6 months ago

This should fix it: https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/58154

If you want to test locally, you can install c-ares from the edge repo while on 3.19 and see if it works.

Tithugues commented 6 months ago

This should fix it: https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/58154

If you want to test locally, you can install c-ares from the edge repo while on 3.19 and see if it works.

Hi!

Thanks for the information! Here is the result of my test:

/ # echo "@edge https://dl-cdn.alpinelinux.org/alpine/edge/main" >> /etc/apk/repositories
/ # apk add c-ares@edge curl
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/edge/main/x86_64/APKINDEX.tar.gz
(1/8) Installing c-ares@edge (1.24.0-r0)
(2/8) Installing ca-certificates (20230506-r0)
(3/8) Installing brotli-libs (1.1.0-r1)
(4/8) Installing libunistring (1.1-r2)
(5/8) Installing libidn2 (2.3.4-r4)
(6/8) Installing nghttp2-libs (1.58.0-r0)
(7/8) Installing libcurl (8.5.0-r0)
(8/8) Installing curl (8.5.0-r0)
Executing busybox-1.36.1-r15.trigger
Executing ca-certificates-20230506-r0.trigger
OK: 12 MiB in 23 packages
/ # curl --version
curl 8.5.0 (x86_64-alpine-linux-musl) libcurl/8.5.0 OpenSSL/3.1.4 zlib/1.3 brotli/1.1.0 c-ares/1.24.0 libidn2/2.3.4 nghttp2/1.58.0
Release-Date: 2023-12-06
Protocols: dict file ftp ftps gopher gophers http https imap imaps mqtt pop3 pop3s rtsp smb smbs smtp smtps telnet tftp ws wss
Features: alt-svc AsynchDNS brotli HSTS HTTP2 HTTPS-proxy IDN IPv6 Largefile libz NTLM SSL threadsafe TLS-SRP UnixSockets
/ # curl www.google.com
curl: (7) Failed to connect to www.google.com port 80 after 2001 ms: Couldn't connect to server

At least in my environment, it seems to still fail even with the new version of c-ares.

If you see any issue with my test or would like me to test anything else, please let me know. :pray:

Thanks again.

niconoe- commented 6 months ago

FWIW, I have the exact same report than @Tithugues above: I still can't connect with the same issue.

That means my assumption that issue was caused by c-ares in v1.22 is wrong.

To me, the issue is still related to the fact I can't connect to anything with IPv6, and the "software" responsible for DNS resolution is failing to do its job properly, a.k.a. fallback on IPv4. I thought it was c-ares, as it was a new dependency or curl in alpine 3.19 compared to 3.18, but maybe I was wrong, or maybe it is actually c-ares but the version 1.24 doesn't fix my problem.

When running this on alpine 3.19

/ > curl -vvv www.google.com --trace-time
# 09:55:27.932169 * Host www.google.com:80 was resolved.
# 09:55:27.932296 * IPv6: 2a00:1450:4001:80b::2004
# 09:55:27.932366 * IPv4: (none)
# 09:55:27.932437 *   Trying [2a00:1450:4001:80b::2004]:80...
# 09:55:27.932526 * Immediate connect fail for 2a00:1450:4001:80b::2004: Network unreachable
# 09:55:27.932587 * Failed to connect to www.google.com port 80 after 2001 ms: Couldn't connect to server
# 09:55:27.932639 * Closing connection
# curl: (7) Failed to connect to www.google.com port 80 after 2001 ms: Couldn't connect to server

I can clearly see that DNS is resolving the IPv6 faster, but as I can't use IPv6, I just can't connect. Such thing should be tested before the IPv6 resolution starts as there's no point on resolving it.

If I run the exact same command under alpine 3.18, we can see the DNS resolution is done on both IPv6 and IPv4:

/ > curl -vvv www.google.com --trace-time
# 10:00:08.889585 * Host www.google.com:80 was resolved.
# 10:00:08.889688 * IPv6: 2a00:1450:4001:80b::2004
# 10:00:08.889758 * IPv4: 142.250.186.164
# 10:00:08.889855 *   Trying 142.250.186.164:80...
# 10:00:08.892884 * Connected to www.google.com (142.250.186.164) port 80
# 10:00:08.893074 > GET / HTTP/1.1
# 10:00:08.893074 > Host: www.google.com
# 10:00:08.893074 > User-Agent: curl/8.5.0
# 10:00:08.893074 > Accept: */*
# 10:00:08.893074 > 
# 10:00:08.938750 < HTTP/1.1 200 OK
# 10:00:08.938830 < Date: Thu, 04 Jan 2024 10:00:08 GMT
# 10:00:08.938886 < Expires: -1
# 10:00:08.938950 < Cache-Control: private, max-age=0
# 10:00:08.939035 < Content-Type: text/html; charset=ISO-8859-1
# 10:00:08.939107 < Content-Security-Policy-Report-Only: object-src 'none';base-uri 'self';script-src 'nonce-btQ4x8jy7FUdjfmcMd8zxQ' 'strict-dynamic' 'report-sample' 'unsafe-eval' 'unsafe-inline' https: http:;report-uri https://csp.withgoogle.com/csp/gws/other-hp
# 10:00:08.939171 < Server: gws
# 10:00:08.939236 < X-XSS-Protection: 0
# 10:00:08.939308 < X-Frame-Options: SAMEORIGIN
# 10:00:08.939367 < Set-Cookie: AEC=Ackid1T8i9FSUMjTgdj_cyfnoIvnHWy4Kp6QBB4EJ6ShA1xNuiHoehcWOw; expires=Tue, 02-Jul-2024 10:00:08 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax
# 10:00:08.939431 < Accept-Ranges: none
# 10:00:08.939520 < Vary: Accept-Encoding
# 10:00:08.939577 < Transfer-Encoding: chunked
# 10:00:08.939639 < 
# <!doctype html>…</html> # Google's homepage.

so, that's working thanks to IPv4 connection.

Whatever "software", responsible for stopping any DNS resolution as soon as either IPv6 or IPv4 is resolved, must be improved to either:

With my hands tied on this currently, I don't know how to go further here, unfortunately :cry:

bradh352 commented 6 months ago

As stated before, you have both a local dns server running at 127.0.0.1 and configurations of other servers which greatly complicates the ability to debug what is going on. I'd need access to a system that's not working in order to have any chance of determining what is really going on.

Likely https://github.com/c-ares/c-ares/pull/551 plays a role in the issue, but it doesn't seem wise to revert that as it will greatly extend DNS resolution times.

niconoe- commented 6 months ago

As stated before, you have both a local dns server running at 127.0.0.1 and configurations of other servers which greatly complicates the ability to debug what is going on.

When in my container, if I open the /etc/resolv.conf file, I can see there's a line with nameserver <my_local_machine_ip>. As soon as I remove this line, curl is able to resolve the addresses in both IPv6 and IPv4, and the request succeeds.

So, indeed, something is related to the configuration of my local DNS on my local machine. Thank you very much for pointing this out :love_you_gesture:

This leads to 2 questions to me then: 1) Why in alpine 3.18 there's no issue with my local DNS configuration (or maybe there is, but it's not blocking while it is blocking in alpine:3.19)? 2) How can I understand what's wrong in my local DNS configuration so I can fix it?

Question 2 is probably for IT dept. of my company :laughing: .

bradh352 commented 6 months ago

What is the behavior the opposite direction, if you leave only that local DNS server in place? Does it still get an ipv6 address (when running curl with -Ivvv)?

niconoe- commented 6 months ago

What is the behavior the opposite direction, if you leave only that local DNS server in place? Does it still get an ipv6 address (when running curl with -Ivvv)?

Nope, it makes "www.google.com" unresolvable:

/ > curl -I -vvv www.google.com
# * Could not resolve host: www.google.com
# * Closing connection
# curl: (6) Could not resolve host: www.google.com

So maybe there's a real big issue with my local DNS that used to be masked by the other nameservers I have. However, I think I still need this nameserver to my local machine in order for my containers to communicate each other.

bradh352 commented 6 months ago

so is your local nameserver meant to only resolve some subset of domains, specific to your internal network? If so, I believe its supposed to have a # suffix to indicate the base domain it is authoritative for (that said, c-ares doesn't currently support that, we have a ticket on that https://github.com/c-ares/c-ares/issues/642 )

bradh352 commented 6 months ago

By the way, my theory is your local DNS server is configured to be recursive, but since IPv6 is not working on the host the ipv6 fails fast, so c-ares sends the ipv6 query to the next configured server. But the ipv4 query tries to recurse within your local DNS server, and eventually fails and returns that failure to c-ares ... however, by the time it fails, c-ares already received a legitimate reply for ipv6 from the next server so any retries for ipv4 are halted and you get only an ipv6 address back.

If that is really what is happening, this falls within an "undefined behavior" grey zone. Since your local DNS server can't recurse, recursion should be disabled in its configuration, which in theory should fix the issue.

niconoe- commented 6 months ago

I believe its supposed to have a # suffix to indicate the base domain it is authoritative for

Thanks for sharing this, I wasn't aware about that. I'll do that soon.

By the way, my theory is your local DNS server is configured to be recursive, but since IPv6 is not working on the host the ipv6 fails fast, so c-ares sends the ipv6 query to the next configured server.

I really do think so, or kind of. My local DNS is dnsmasq and if I understand what I found online about it, it's not recursive, but it follows everything, and probably the behavior is similar to what you described. Problem is, there is no way to not follow unless I add the no-resolv configuration on dnsmasq. But if I do that, it will no longer resolve anything so when I have a docker service called acme-my-service that I can reach today from another container via curl -I acme-my-service.my-company.local, it will no longer be accessible then as my local DNS will not resolve such domain name.

Or maybe I'm misunderstanding something?

bradh352 commented 6 months ago

Are your domains you're trying to resolve really ending in ".local"? If so, ".local" is reserved for multicast DNS (mDNS). That would also mean you're not maintaining any form of internal dns records within your local resolver.

Perhaps this is a workaround to the fact that the alpine linux musl libc resolver doesn't implement multicast dns, but dnsmasq does, which makes a lot of sense why you might have your configuration this way.

Infact, c-ares doesn't yet support multicast dns either, but it is something we are aware of and is on my task list ( https://github.com/c-ares/c-ares/issues/171 ).

The obvious solution here would be to make it so your dnsmasq can fully perform recursive DNS operations properly, and make it the only dns server in your /etc/resolv.conf. That is honestly the only configuration that would make your setup not rely on some undefined behavior (that just so happens to work some or most of the time).

beroset commented 6 months ago

This should fix it: https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/58154

If you want to test locally, you can install c-ares from the edge repo while on 3.19 and see if it works.

FYI, I came across this via a different route. I was using alpine:latest to build some software and discovered that git reported a domain lookup failure:

/tmp # git clone https://github.com/AsteroidOS/asteroidos.org.git
Cloning into 'asteroidos.org'...
fatal: unable to access 'https://github.com/AsteroidOS/asteroidos.org.git/': Could not resolve host: github.com

I can verify that using alpine:edge (202312119) instead of alpine:latest (3.19.0) fixes this problem and that alpine:3.18 (3.18.5) also works.

niconoe- commented 6 months ago

Are your domains you're trying to resolve really ending in ".local"? If so, ".local" is reserved for multicast DNS (mDNS). That would also mean you're not maintaining any form of internal dns records within your local resolver.

Perhaps this is a workaround to the fact that the alpine linux musl libc resolver doesn't implement multicast dns, but dnsmasq does, which makes a lot of sense why you might have your configuration this way.

Yes, my servers are reachable via .local in my local environment. They used to be reachable via .dev, but I had to change when Google decided to reserve the .dev TLD :laughing:. I guess I just picked up twice the bad TLD :laughing:.

The obvious solution here would be to make it so your dnsmasq can fully perform recursive DNS operations properly, and make it the only dns server in your /etc/resolv.conf. That is honestly the only configuration that would make your setup not rely on some undefined behavior (that just so happens to work some or most of the time).

Unfortunately, due to constraints given by my company, I can't remove the other dns servers, otherwise I won't have access to internal servers my company's hosting. I'm trying to put my local dns at the end of the list, hoping for the other DNS servers to be configured better than my own, and letting c-ares going through until it reaches mine when appropriate.

niconoe- commented 6 months ago

This should fix it: https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/58154 If you want to test locally, you can install c-ares from the edge repo while on 3.19 and see if it works.

FYI, I came across this via a different route. I was using alpine:latest to build some software and discovered that git reported a domain lookup failure:

/tmp # git clone https://github.com/AsteroidOS/asteroidos.org.git
Cloning into 'asteroidos.org'...
fatal: unable to access 'https://github.com/AsteroidOS/asteroidos.org.git/': Could not resolve host: github.com

I can verify that using alpine:edge (202312119) instead of alpine:latest (3.19.0) fixes this problem and that alpine:3.18 (3.18.5) also works.

I'll give it a try with alpine:edge. Thanks for the info :heart:

EDIT : aaaaaaand, that's a failure :laughing:

> docker run --rm -it --entrypoint=/bin/sh alpine:edge
# Unable to find image 'alpine:edge' locally
# edge: Pulling from library/alpine
# dcccee43ad5d: Pull complete 
# Digest: sha256:9f867dc20de5aa9690c5ef6c2c81ce35a918c0007f6eac27df90d3166eaa5cc0
# Status: Downloaded newer image for alpine:edge
/ > apk add curl
# fetch https://dl-cdn.alpinelinux.org/alpine/edge/main/x86_64/APKINDEX.tar.gz
# fetch https://dl-cdn.alpinelinux.org/alpine/edge/community/x86_64/APKINDEX.tar.gz
# (1/8) Installing ca-certificates (20230506-r0)
# (2/8) Installing brotli-libs (1.1.0-r1)
# (3/8) Installing c-ares (1.24.0-r0)
# (4/8) Installing libunistring (1.1-r2)
# (5/8) Installing libidn2 (2.3.4-r4)
# (6/8) Installing nghttp2-libs (1.58.0-r0)
# (7/8) Installing libcurl (8.5.0-r0)
# (8/8) Installing curl (8.5.0-r0)
# Executing busybox-1.36.1-r17.trigger
# Executing ca-certificates-20230506-r0.trigger
# OK: 12 MiB in 23 packages
/ > curl -I -vvv www.google.com
# * Host www.google.com:80 was resolved.
# * IPv6: 2a00:1450:4025:401::69, 2a00:1450:4025:401::93, 2a00:1450:4025:401::67, 2a00:1450:4025:401::6a
# * IPv4: (none)
# *   Trying [2a00:1450:4025:401::69]:80...
# * Immediate connect fail for 2a00:1450:4025:401::69: Address not available
# *   Trying [2a00:1450:4025:401::93]:80...
# * Immediate connect fail for 2a00:1450:4025:401::93: Address not available
# *   Trying [2a00:1450:4025:401::67]:80...
# * Immediate connect fail for 2a00:1450:4025:401::67: Address not available
# *   Trying [2a00:1450:4025:401::6a]:80...
# * Immediate connect fail for 2a00:1450:4025:401::6a: Address not available
# * Failed to connect to www.google.com port 80 after 2002 ms: Couldn't connect to server
# * Closing connection
# curl: (7) Failed to connect to www.google.com port 80 after 2002 ms: Couldn't connect to server
niconoe- commented 6 months ago

The content of /etc/resolv.conf in my containers is actually defined by the configuration I gave into my /etc/docker/daemon.json local file.

EDIT: (by the way, trying to add a # at the end of my local IP as indicated here: https://github.com/alpinelinux/docker-alpine/issues/366#issuecomment-1877116062 doesn't work in /etc/docker/daemon.json as Docker detects the content is not an IP address and refuse to admit this configuration. I'll just ignore this, unfortunately).

If I put my local DNS (nameserver <my_ip>) at the end of the list, I manage to solve my cURL issue, as I let my company's DNS resolve the domains instead of mine, and the call works. Somehow, it also works when I ask for local domains, probably because DNS of my company are better configured than my dsnmasq, so I think I'll just go with it: considering my local DNS as the last attempt to resolve domains so calls to external can be solved correctly. As a downside, it will just slowdown a bit (some microseconds to milliseconds) my internal calls between my servers, but as this is for local environement only, I guess that's acceptable.