Open JustArchi opened 6 years ago
cc: @stephentoub
Thanks @JustArchi for your report! Can you please clarify repro status on Windows? Did it repro or not? Or did you not try at all? (that's ok, just clarifying your comment above)
@wfurt can you please take a look? cc @geoffkizer
@karelz I checked .NET Core 2.1 rc1 on Windows and couldn't reproduce the issue, got OK
from my repro.
I did quick try on Ubuntu 16.04 and I also get OK. What base OS do you use @JustArchi ?
It's Debian Testing (currently 10/Butcher
) with all available updates, kernel 4.16.0-1-amd64
(Debian 4.16.5-1
).
I also thought the OS could have something to do with it, but then curl handler wouldn't work either, so it's possible that it's some OS-layer incompatibility.
It's also good to note that I couldn't reproduce it with other https
servers, only Google's one gave me issues, so it's not entirely broken either. I'm sure that my SSL certs are OK, otherwise curl wouldn't work either.
I checked my other dev machine running on Debian sid and I also couldn't reproduce my issue there either, I'll set up other one on testing in the meantime to ensure it's not on Debian's end, but then again, if it was some IP-related block or likewise, curl shouldn't work either.
I tested another machine on Debian Testing and couldn't reproduce the issue either, so it has to be something with one of my machines. I wonder how I can help narrow down this issue and what is the root cause of it in the first place. Do you have any idea what I could provide do to help?
In the meantime I'll keep looking, maybe I find some factor that could possibly be causing this on OS side.
You can always try packet capture. You can also check if the name resolves to same IP address. Also is the hardware on both machines similar? number of cores would for example impact size of thread pool.
@JustArchi so it seems it is one machine problem (with a specific URL) on Debian development branch. Is that correct? If that's the case, I think the best thing is to debug it through all layers and identify which one behaves incorrectly.
Yes, one machine for now, and I found out the first clue.
My dev machines that I tried to reproduce this issue on (and had no luck) don't have IPv6, they used IPv4 address of the translate.google.com
and that worked. On the machine with the bug reproduced, I forced usage of IPv4 by manually hardcoding 216.58.215.46 translate.google.com
to /etc/hosts
, and that one did the trick, I couldn't run into the bug anymore. So this issue is IPv6 related, now the question is whether this is still my machine-specific IPv6 related issue, or something more global.
Since I don't have more IPv6 machines around, could you try to reproduce this bug for me on IPv6-enabled Linux machine? I don't believe Debian has anything to do with it, so it could be any Linux machine.
I wonder if curl handler also uses IPv6 by default, this could be the reason why I see behaviour change between those two, gonna keep digging and let you know if I found out something more. It could also be some IPv6-related bug in socket handler, although in this case I'm pretty sure somebody else would find it out much sooner than me, maybe it's some specific combination, I'll try to find out.
can you do packet capture and trace the attempt? I'm wondering if IPv6 is really used or if it falls back to IPv4 after some time.
I did the test and I indeed confirmed my initial thought, this is not a bug in socket http handler itself, but rather different behaviour compared to curl handler. Could be also called a regression if we're comparing default behaviour between .NET Core 2.0 and 2.1.
In general, my machine can't access translate.google.com
through IPv6, why exactly - I'm going to find out myself, this is not important for now.
What is important is the fact that it seems that curl handler tries to fallback to IPv4 when IPv6 fails, this is why curl handler works.
If I force curl
to use IPv6 exclusively, this also fails:
$ curl -I -m 5 -6 "https://translate.google.com"
curl: (28) Connection timed out after 5001 milliseconds
While -4
works:
$ curl -I -m 5 -4 "https://translate.google.com"
HTTP/2 403
(We're getting 403 but it's irrelevant, the connection being established matters)
So what is curl doing by default? Let's find out:
$ curl -v -I -m 5 "https://translate.google.com"
* Rebuilt URL to: https://translate.google.com/
* Trying 2a00:1450:400c:c0a::8a...
* TCP_NODELAY set
* Trying 216.58.215.46...
* TCP_NODELAY set
* Connected to translate.google.com (216.58.215.46) port 443 (#0)
Like you can see, it tries IPv6 first, fails, then tries IPv4 next. SocketHttpHandler
doesn't do that, once it gets IPv6 from DNS resolve, it'll try to connect through that IP address only, even if it doesn't work while IPv4 address does.
Question is, if this is intended behaviour (then we can close the issue), or if perhaps we could do something to improve it (e.g. make it work like curl handler), since I strongly believe that it'd greatly benefit HttpClient
if it did exactly the same what curl does - attempt to try the next (different) IP address for the same domain, if possible.
I'm pretty sure that you can reproduce this reliably with any IPv6-enabled machine requesting any domain with AAAA
and A
DNS entries where IPv6 address simply times out, while IPv4 works just fine. It happened to be google server in my case, why exactly - probably a coincidence, temporary issue, or some machine-specific problem I'm going to find out myself, since it's not important for this issue.
Like you can see, it tries IPv6 first, fails, then tries IPv4 next. SocketHttpHandler doesn't do that, once it gets IPv6 from DNS resolve, it'll try to connect through that IP address only, even if it doesn't work while IPv4 address does.
SocketsHttpHandler should be trying all the IP addresses (IPv6 and IPv4) that are returned from the DNS resolver API calls. On Windows, there is an API to connect to the DNS name which will automatically try IPv6 and IPv4 in parallel to speed up connection result so that it won't have to waste time failing on the IPv6 address before it gets connected on the IPv4.
I wonder how I can confirm this then or debug the issue further, since this issue doesn't happen if I manually hardcode translate.google.com
to 216.58.215.46
via /etc/hosts
. DNS query is then omitted entirely.
I took a quick look if perhaps data returned by DNS resolve could something to do with it, but it looks fine:
$ host translate.google.com
translate.google.com is an alias for www3.l.google.com.
www3.l.google.com has address 216.58.215.46
www3.l.google.com has IPv6 address 2a00:1450:4007:808::200e
$ ping4 translate.google.com
PING www3.l.google.com (216.58.215.46) 56(84) bytes of data.
$ ping6 translate.google.com
PING translate.google.com(par21s17-in-x0e.1e100.net (2a00:1450:4007:808::200e)) 56 data bytes
It might have something to do with the fact that IPv4 query is not even being tried since we time out with IPv6, and we don't have enough of time to try IPv4 next. I mean, it's not intended to send 2 requests through 2 different IP addresses right away, so there has to be some kind of timeout to move forward.
If CURL got "stuck" on that IPv6, then it'd time out as well, eventually. But it's somehow smart and detects in a fraction of second that IPv6 connection fails, then moves out to IPv4 immediately, making it in time regarding supplied timeout. Socket handler probably times out on IPv6, and doesn't even have enough of time to try out IPv4 next. This is just my theory though, I really don't know socket handler internals, you're the expert here.
There is definitely a bit different behaviour regarding handling this though, since switching to curl handler solves the initial issue for me, as well as hardcoding IPv4 address manually in /etc/hosts
. On top of the fact that if DNS query was faulty, then curl couldn't possibly work either.
just for the record, I verified that it can work via IPv6 - if IPv6 works. I think this has been general problem as global site may advertise both IPv4/6 but local network or ISP may not support IPv6. Some systems allows to set preferences: http://sf-alpha.bjgang.org/wordpress/2012/08/linux-prefer-ipv4-over-ipv6-in-dual-stack-environment-and-prevent-problems-when-only-ipv4-exists/ https://wiki.vpsget.com/index.php/Prefer_IPv4_over_ipv6_._How_to_set_ipv4_precedence That may be short term solution.
Yeah the fact that I can't connect through IPv6 is definitely my machine issue and I'll solve it in one way or another, this is a technical difficulty that is on me.
I'm just wondering now if there is anything to improve here in this case or we should keep it like that, since technically it's not broken, but it could be improved in a way to work like curl handler did. If people have such "broken" setups (or rather lack of IPv6 without even knowing about it) then they might see regressions on linux with .NET Core 2.1 where curl handler worked just fine with it on .NET Core 2.0, while new socket handler no longer does and times out.
It would be nice to have SocketsHttpHandler more resilient, however, I don't think we should treat it as compatibility problem between curl handler and SocketsHttpHandler - rather as enhancement. (unless we find out lots of people hitting it)
I would strongly recommend to not use the curl handler as workaround, rather use the IPv4 address. Otherwise you will be stuck in the past. Our plan is to eventually get rid of the curl handler entirely (hopefully in next major release).
It's alright, thank you a lot for your help, I'll leave this issue open as an enhancement, but feel free to close it if you decide that it's not worth it to improve socket handler in this regard.
In the meantime I'll see how I can fix IPv6 on my machine 🙂
Have a nice day!
EDIT: In case somebody would have similar issue on linux, I added precedence ::ffff:0:0/96 100
to /etc/gai.conf
- this tells getaddrinfo()
to prefer IPv4 addresses over IPv6 when both are available.
From offline discussion: There is a concern that SocketsHttpHandler does not try all entries returned by DNS. We should check it. That would qualify as something we might need to fix (potentially in servicing).
@rmkerr can you please take a look at SocketsHttpHandler and how it deals with multiple DNS entries?
Yep, I'll take a look!
Based purely on code inspection this looks correct on the SocketsHttpHandler
side. We're using DnsEndpoint
with Socket.ConnectAsync
:
https://github.com/dotnet/corefx/blob/d5f6e6a10b9b1a9cb42066ebceea21e929adb487/src/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/ConnectHelper.cs#L56-L61
The documentation for DnsEndpoint
makes it clear that ConnectAsync
will attempt to connect to each address until a connection succeeds, regardless of the address family:
The AddressFamily property of any Socket that is created by calls to the ConnectAsync method
will be the address family of the first address to which a connection can be successfully
established (not necessarily the first address to be resolved).
There could still be a bug in Socket
or in DnsEndpoint
(or the docs), so this warrants more investigation. I'm going to work on reproducing this issue locally, and will update this thread when I have results.
I wonder if we have some logging in Sockets/SocketsHttpHandler that could help us confirm what is happening on @JustArchi's machine. Maybe in combo with Wireshark ...
How long does it take to try new address @rmkerr ? There was overall 60s timeout on HTTP.
Also from diagnostic @karelz
System.OperationCanceledException: The operation was canceled. at System.Net.Http.HttpClient.HandleFinishSendAsyncError(Exception e,CancellationTokenSource cts) at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
It would be really nice if this clearly states we were not able to connect. According to @JustArchi IPv6 never worked on his machine so TCP would never been established. Cryptic message like the one above is not that helpful for troubleshooting.
@karelz it might be useful to have logs if I can't reproduce the issue. I have not been able to repro the issue on Windows, but I will try on Linux before taking that approach.
@wfurt I'm not sure of the exact time, but it is far less than 60 seconds. When running the app in a console on windows there is no visible delay. I think this is likely a Linux specific issue though, so I will try it there next.
@karelz Since my ISP has fixed the issue and my IPv6 route is operative again, sadly I'm unable to assist further due to not being able to reproduce this issue anymore myself, at least physically.
However, I can definitely provide some further details about what exactly was happening:
Therefore reproducing this issue might be tricky, because normally when there is no IPv6 support or route, then this address is not even being used, or route is immediately announced as being unreachable. You'd need to cause a networking issue on the other side, such as manually dropping all IPv6 packets from given machine to outside world e.g. on router, without the host machine knowing that you're intentionally cutting IPv6 traffic despite of route being up. This is different because it leads to valid connection timing out vs router announcing that the network is unreachable.
To me it looked like HttpClient
did the thing right by trying to connect through IPv6 firstly, but due to excessive timeout, IPv4 try never happened, as we reached HttpClient
timeout before in fact even trying other DNS endpoint. curl
handler could return from that faulty IPv6 connection immediately, and this is why it worked, because IPv4 was tried next. This is only my theory though of how it looked like, the issue might be entirely different, so it should be taken with a grain of salt, but I confirmed that curl
indeed does that by verbosely seeing what was happening while my issue was up, with output above, so very likely curl handler right in HttpClient
did the very same thing.
Perhaps it could be possible to reproduce this issue on OS level artificially, without even fiddling with the router, using ip6tables
:
ip6tables -P INPUT DROP
ip6tables -P FORWARD DROP
ip6tables -P OUTPUT DROP
This could potentially reproduce my issue, since the kernel won't announce packets being dropped and just silently not respond at all. I didn't check if this in fact reproduces the issue, but I think it should. I can try to run into this issue if you'd like to, but only artificially, as physically everything is fine with my machine again.
Hope it helps.
If you get repro again, could you run this @JustArchi strace -t -f -e connect ~/latest/dotnet run 2>&1 | grep AF_INET
That would show sequence and timing for attempts to connect. I did more digging and in my setup IPv6 fails immediately with ENETUNREACH and that may be different than yours.
Hooray! My ip6tables
trick in fact did work and reproduced this issue artificially, so I can still be useful after all 🙂
Here are the results (I have local DNS server for name resolve):
root@archi:/tmp/test# strace -t -f -e connect dotnet run 2>&1 | grep AF_INET
[pid 20589] 20:32:02 connect(39, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
[pid 20589] 20:32:02 connect(39, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("216.58.206.238")}, 16) = 0
[pid 20589] 20:32:02 connect(39, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "2a00:1450:4007:815::200e", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
[pid 20589] 20:32:02 connect(37, {sa_family=AF_INET6, sin6_port=htons(443), inet_pton(AF_INET6, "2a00:1450:4007:815::200e", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
# My repro is stuck for around 60 seconds here, exits silently without printing exception due to grepping
root@archi:/tmp/test# DOTNET_SYSTEM_NET_HTTP_USESOCKETSHTTPHANDLER=0 strace -t -f -e connect dotnet run 2>&1 | grep AF_INET
[pid 20882] 20:34:04 connect(36, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
[pid 20882] 20:34:04 connect(36, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("216.58.206.238")}, 16) = 0
[pid 20882] 20:34:04 connect(36, {sa_family=AF_INET6, sin6_port=htons(443), inet_pton(AF_INET6, "2a00:1450:4007:815::200e", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
[pid 20881] 20:34:04 connect(36, {sa_family=AF_INET6, sin6_port=htons(443), inet_pton(AF_INET6, "2a00:1450:4007:815::200e", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
[pid 20881] 20:34:05 connect(37, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("216.58.206.238")}, 16) = -1 EINPROGRESS (Operation now in progress)
# Program finishes here almost immediately without being stuck, repro "OK" not printed due to grepping
I believe you should be able to reproduce it as well by just following my ip6tables
trick above. Of course, you still require a machine that has IPv6 enabled and operative.
Let me know if I can help further.
Thanks for all the info @JustArchi! I'm able to reproduce this issue on Linux and can almost reproduce it on Windows (more on that later). Rather than taking the iptables approach, I edited my hosts file to add a test hostname with two entries:
0100:f8b0:400a:809::200e local.test.addr
127.0.0.1 local.test.addr
The IPv6 address does not exist, and the IPv4 address can point to any server that will return a response.
On both Windows and Linux we attempt to connect to the IPv6 address first. We receive no response, and attempt several SYN re-transmissions. There appears to be a platform difference here -- we are attempting more re-transmissions on Linux than we are on Windows.
The end result on Windows is that we have a delay of around 30 seconds before connecting to the second address. This is the same behavior as on WinHttpHandler, so at a minimum we do not have a regression on Windows here. (cc: @karelz)
On Linux, it looks like we are either waiting longer between retries, or we are retrying more times. This additional time pushes us over the HTTP timeout imposed in .NET before attempting to connect to the second address. That timeout causes a System.OperationCanceledException
to be thrown.
I'm going to do some more work to verify that this is actually what is happening, and then I will look into reducing the number or frequency of TCP re-transmissions on Linux. That should make this behavior consistent cross platform.
This additional time pushes us over the HTTP timeout imposed in .NET before attempting to connect to the second address. That timeout causes a System.OperationCanceledException to be thrown.
@rmkerr, so if you set the timeout to be Infinite, it eventually succeeds?
Side note, it could also be beneficial to check how exactly curl
handler achieves its instant switch and see if it makes sense to implement this, since it's definitely a superior solution compared to 30 seconds delay. I'm shocked myself how well it handled the problem, the switch was almost immediate, like if network was unreachable entirely, but it definitely did try to do something as IPv6 was tried first (confirmed on my curl
test above as well as strace
output). In fact, this is why I thought at the very beginning of this issue that it was SocketHttpHandler
bug and not just a technical issue, since I didn't expect networking problem when just switching the handler did the trick.
This is just a side note though, what matters to me is that you were able to reproduce this, so everything is in good hands now, thank you a lot for taking a look into this 🙂
@stephentoub Yes! After several minutes I get a 200 response.
it could also be beneficial to check how exactly curl handler achieves its instant switch and see if it makes sense to implement this
I believe libcurl uses Happy Eyeballs: https://en.wikipedia.org/wiki/Happy_Eyeballs and defaults to 200ms for the switch: https://curl.haxx.se/libcurl/c/CURLOPT_HAPPY_EYEBALLS_TIMEOUT_MS.html
We should consider doing that within Socket.Connect{Async} on Unix.
Thanks for the info! I'll try to put together an implementation tomorrow.
Hey @stephentoub and @davidsh, I think it could be beneficial to bring Happy Eyeballs over to Windows too. There is still a significant (~20s) delay we could eliminate when connecting. Before I work on the implementation, can you think of anything that might make that infeasible?
I think it could be beneficial to bring Happy Eyeballs over to Windows too.
Windows winsock connect APIs are supposed to do this already. They are supposed to do parallel IPv4 and IPv6 connect attempts when connecting via DNS name. This feature might only be available in certain newer versions of Windows OS.
I could be wrong, but I don't think that we are taking advantage of those APIs in the case demonstrated here. That's based off my reading of the code in MultipleConnectAsync.cs, which contains the classes we'll probably have to update to make this change.
but I don't think that we are taking advantage of those APIs in the case demonstrated here.
It would probably be easier to just use the built-in support for this instead of trying to do it ourselves and have to manage multiple, simultaneous, socket connect operations.
It would probably be easier to just use the built-in support for this instead of trying to do it ourselves and have to manage multiple, simultaneous, socket connect operations.
We need to do it for Unix, anyway.
The other option is we do it at the SocketsHttpHandler level. That frees us from mucking with this old APM code in the sockets layer, and it implicitly makes it apply across platforms. But it of course then only applies to SocketsHttpHandler and not to other consumers of Socket.ConnectAsyc… then again, it may not be desirable for all consumers of Socket.ConnectAsync, so maybe we should just start at the SocketsHttpHandler level.
I think it would be reasonable to implement at the SocketsHttpHandler level. Doing it that way would also preserve the current behavior of WinHttpHandler, which would help reduce the risk involved. Looking at the code I don't think that it will be too difficult to make the change there, though I will almost certainly have to ask some questions while I work on it.
I've put together a functional implementation of Happy Eyeballs in SocketsHttpHandler that you can check out here. I'm going to spend some time tomorrow improving the performance and working on tests. Depending on how that goes I will hopefully have a PR out tomorrow or Monday.
Interesting issue. The root cause is misconfiguration of some sort. This makes a machine use an IPv6 address on a network that doesn't support it. Then even tough the DNS returned multiple addresses, the TCP connection timeout of the OS for 1 address exceeds the HttpClient timeout.
I wonder if we (corefx) are responsible to deal with such misconfiguration issues? And if we do, at what level (TCP/HTTP)? Perhaps we also need to do something about the mismatch between connection timeout and HttpClient timeout?
I noticed one of app using .netcore on my Linux NAS stops getting connection to some URLs which DNS returns IPv6 address - an earlier version of the app worked fine, so something seems to be changed in .netcore recently. Any workaround to this problem?
Does not seem critical for 3.0.
This 10 years old article describes a currently ongoing IPv6 peering dispute between Cogent and Hurricane Electric; https://www.datacenterknowledge.com/archives/2009/10/22/peering-disputes-migrate-to-ipv6
So the issue isn't so uncommon to stumble over in the wild, unfortunately.
Any progress on the implementation of happy eyeballs or similar mechanisms? Or would it be best to try to deal with this on the application side for now?
Triage: While doable via dialers (#35404), we believe it is worth to do it on its own.
It should be done in Sockets.Connect
- maybe as an option, where SocketsHttpHandler
opts into it all the time.
Note that Windows has the Happy Eyeballs feature (when resolving as name), we just don't use it. We have to decide if we want to do it ourselves x-plat, or just on Linux.
Dialers are checked in #1793 ... however, we need a few DNS APIs to make it easy. Workaround: PInvoke into the DNS APIs. Moving to 6.0
Repro: HttpClientBug.zip
I reproduced this one on Linux and I didn't have much luck on Windows.
Run repro with
dotnet run
. After a default timeout of around 60 seconds, you'll get:Doing the same by forcing older curl handler:
Please note that this issue is specific and not reproducible with just any
https
server, as majority of them work just fine. I encountered this issue when accessinghttps://translate.google.com
, which is what I used in my repro above.I reproduced this bug on latest
master
as well as .NET Core 2.1 rc1.This bug could be some sort of regression because previously my app running
master
SDK worked just fine with this URL, includingSocketHttpHandler
that I used for a longer while. It could also be regression caused by Google's servers configuration change that triggered bug existing in the code since quite some time, which is more likely. Of course this one is not reproducible on .NET Core 2.0, since there is noSocketHttpHandler
there.Thank you in advance for looking into this.
[EDIT] Inline C# source code by @karelz