Consider using Happy Eyeballs or similar in SocketsHttpHandler

JustArchi commented 6 years ago

using System;
using System.Net.Http;
using System.Threading.Tasks;

internal static class Program
{
    private static async Task Main()
    {
        AppDomain.CurrentDomain.UnhandledException += OnUnhandledException;
        TaskScheduler.UnobservedTaskException += OnUnobservedTaskException;
        using (var httpClient = new HttpClient())
        {
            try
            {
                await httpClient.GetAsync("https://translate.google.com").ConfigureAwait(false);
                Console.WriteLine("OK");
            }
            catch (Exception e)
            {
                Console.WriteLine(e);
            }
        }
    }

    private static void OnUnobservedTaskException(object sender, UnobservedTaskExceptionEventArgs e)
    {
        Console.WriteLine(e.Exception);
    }

    private static void OnUnhandledException(object sender, UnhandledExceptionEventArgs e)
    {
        Console.WriteLine(e.ExceptionObject);
    }
}

I reproduced this one on Linux and I didn't have much luck on Windows.

Run repro with dotnet run. After a default timeout of around 60 seconds, you'll get:

$ dotnet run
System.OperationCanceledException: The operation was canceled.
   at System.Net.Http.HttpClient.HandleFinishSendAsyncError(Exception e, CancellationTokenSource cts)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at HttpClientBug.Program.Main() in /tmp/httpclientbug/Program.cs:line 17

Doing the same by forcing older curl handler:

$ DOTNET_SYSTEM_NET_HTTP_USESOCKETSHTTPHANDLER=0 dotnet run
OK

Please note that this issue is specific and not reproducible with just any https server, as majority of them work just fine. I encountered this issue when accessing https://translate.google.com, which is what I used in my repro above.

I reproduced this bug on latest master as well as .NET Core 2.1 rc1.

This bug could be some sort of regression because previously my app running master SDK worked just fine with this URL, including SocketHttpHandler that I used for a longer while. It could also be regression caused by Google's servers configuration change that triggered bug existing in the code since quite some time, which is more likely. Of course this one is not reproducible on .NET Core 2.0, since there is no SocketHttpHandler there.

Thank you in advance for looking into this.

.NET Core SDK (reflecting any global.json):
 Version:   2.2.100-preview1-008636
 Commit:    6c9942bae6

Runtime Environment:
 OS Name:     debian
 OS Version:
 OS Platform: Linux
 RID:         debian-x64
 Base Path:   /opt/dotnet/sdk/2.2.100-preview1-008636/

Host (useful for support):
  Version: 2.1.0-preview3-26411-06
  Commit:  8faa8fcfcf

.NET Core SDKs installed:
  2.2.100-preview1-008636 [/opt/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.0-preview2-30475 [/opt/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.0-preview2-30475 [/opt/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.0-preview3-26411-06 [/opt/dotnet/shared/Microsoft.NETCore.App]

[EDIT] Inline C# source code by @karelz

davidsh commented 6 years ago

cc: @stephentoub

karelz commented 6 years ago

Thanks @JustArchi for your report! Can you please clarify repro status on Windows? Did it repro or not? Or did you not try at all? (that's ok, just clarifying your comment above)

karelz commented 6 years ago

@wfurt can you please take a look? cc @geoffkizer

JustArchi commented 6 years ago

@karelz I checked .NET Core 2.1 rc1 on Windows and couldn't reproduce the issue, got OK from my repro.

wfurt commented 6 years ago

I did quick try on Ubuntu 16.04 and I also get OK. What base OS do you use @JustArchi ?

JustArchi commented 6 years ago

It's Debian Testing (currently 10/Butcher) with all available updates, kernel 4.16.0-1-amd64 (Debian 4.16.5-1).

I also thought the OS could have something to do with it, but then curl handler wouldn't work either, so it's possible that it's some OS-layer incompatibility.

JustArchi commented 6 years ago

It's also good to note that I couldn't reproduce it with other https servers, only Google's one gave me issues, so it's not entirely broken either. I'm sure that my SSL certs are OK, otherwise curl wouldn't work either.

I checked my other dev machine running on Debian sid and I also couldn't reproduce my issue there either, I'll set up other one on testing in the meantime to ensure it's not on Debian's end, but then again, if it was some IP-related block or likewise, curl shouldn't work either.

JustArchi commented 6 years ago

I tested another machine on Debian Testing and couldn't reproduce the issue either, so it has to be something with one of my machines. I wonder how I can help narrow down this issue and what is the root cause of it in the first place. Do you have any idea what I could provide do to help?

In the meantime I'll keep looking, maybe I find some factor that could possibly be causing this on OS side.

wfurt commented 6 years ago

You can always try packet capture. You can also check if the name resolves to same IP address. Also is the hardware on both machines similar? number of cores would for example impact size of thread pool.

karelz commented 6 years ago

@JustArchi so it seems it is one machine problem (with a specific URL) on Debian development branch. Is that correct? If that's the case, I think the best thing is to debug it through all layers and identify which one behaves incorrectly.

JustArchi commented 6 years ago

Yes, one machine for now, and I found out the first clue.

My dev machines that I tried to reproduce this issue on (and had no luck) don't have IPv6, they used IPv4 address of the translate.google.com and that worked. On the machine with the bug reproduced, I forced usage of IPv4 by manually hardcoding 216.58.215.46 translate.google.com to /etc/hosts, and that one did the trick, I couldn't run into the bug anymore. So this issue is IPv6 related, now the question is whether this is still my machine-specific IPv6 related issue, or something more global.

Since I don't have more IPv6 machines around, could you try to reproduce this bug for me on IPv6-enabled Linux machine? I don't believe Debian has anything to do with it, so it could be any Linux machine.

I wonder if curl handler also uses IPv6 by default, this could be the reason why I see behaviour change between those two, gonna keep digging and let you know if I found out something more. It could also be some IPv6-related bug in socket handler, although in this case I'm pretty sure somebody else would find it out much sooner than me, maybe it's some specific combination, I'll try to find out.

wfurt commented 6 years ago

can you do packet capture and trace the attempt? I'm wondering if IPv6 is really used or if it falls back to IPv4 after some time.

JustArchi commented 6 years ago

I did the test and I indeed confirmed my initial thought, this is not a bug in socket http handler itself, but rather different behaviour compared to curl handler. Could be also called a regression if we're comparing default behaviour between .NET Core 2.0 and 2.1.

In general, my machine can't access translate.google.com through IPv6, why exactly - I'm going to find out myself, this is not important for now.

What is important is the fact that it seems that curl handler tries to fallback to IPv4 when IPv6 fails, this is why curl handler works.

If I force curl to use IPv6 exclusively, this also fails:

$ curl -I -m 5 -6 "https://translate.google.com"
curl: (28) Connection timed out after 5001 milliseconds

While -4 works:

$ curl -I -m 5 -4 "https://translate.google.com"
HTTP/2 403

(We're getting 403 but it's irrelevant, the connection being established matters)

So what is curl doing by default? Let's find out:

$ curl -v -I -m 5 "https://translate.google.com"
* Rebuilt URL to: https://translate.google.com/
*   Trying 2a00:1450:400c:c0a::8a...
* TCP_NODELAY set
*   Trying 216.58.215.46...
* TCP_NODELAY set
* Connected to translate.google.com (216.58.215.46) port 443 (#0)

Like you can see, it tries IPv6 first, fails, then tries IPv4 next. SocketHttpHandler doesn't do that, once it gets IPv6 from DNS resolve, it'll try to connect through that IP address only, even if it doesn't work while IPv4 address does.

Question is, if this is intended behaviour (then we can close the issue), or if perhaps we could do something to improve it (e.g. make it work like curl handler), since I strongly believe that it'd greatly benefit HttpClient if it did exactly the same what curl does - attempt to try the next (different) IP address for the same domain, if possible.

I'm pretty sure that you can reproduce this reliably with any IPv6-enabled machine requesting any domain with AAAA and A DNS entries where IPv6 address simply times out, while IPv4 works just fine. It happened to be google server in my case, why exactly - probably a coincidence, temporary issue, or some machine-specific problem I'm going to find out myself, since it's not important for this issue.

davidsh commented 6 years ago

Like you can see, it tries IPv6 first, fails, then tries IPv4 next. SocketHttpHandler doesn't do that, once it gets IPv6 from DNS resolve, it'll try to connect through that IP address only, even if it doesn't work while IPv4 address does.

SocketsHttpHandler should be trying all the IP addresses (IPv6 and IPv4) that are returned from the DNS resolver API calls. On Windows, there is an API to connect to the DNS name which will automatically try IPv6 and IPv4 in parallel to speed up connection result so that it won't have to waste time failing on the IPv6 address before it gets connected on the IPv4.

JustArchi commented 6 years ago

I wonder how I can confirm this then or debug the issue further, since this issue doesn't happen if I manually hardcode translate.google.com to 216.58.215.46 via /etc/hosts. DNS query is then omitted entirely.

I took a quick look if perhaps data returned by DNS resolve could something to do with it, but it looks fine:

$ host translate.google.com
translate.google.com is an alias for www3.l.google.com.
www3.l.google.com has address 216.58.215.46
www3.l.google.com has IPv6 address 2a00:1450:4007:808::200e

$ ping4 translate.google.com
PING www3.l.google.com (216.58.215.46) 56(84) bytes of data.

$ ping6 translate.google.com
PING translate.google.com(par21s17-in-x0e.1e100.net (2a00:1450:4007:808::200e)) 56 data bytes

JustArchi commented 6 years ago

It might have something to do with the fact that IPv4 query is not even being tried since we time out with IPv6, and we don't have enough of time to try IPv4 next. I mean, it's not intended to send 2 requests through 2 different IP addresses right away, so there has to be some kind of timeout to move forward.

If CURL got "stuck" on that IPv6, then it'd time out as well, eventually. But it's somehow smart and detects in a fraction of second that IPv6 connection fails, then moves out to IPv4 immediately, making it in time regarding supplied timeout. Socket handler probably times out on IPv6, and doesn't even have enough of time to try out IPv4 next. This is just my theory though, I really don't know socket handler internals, you're the expert here.

There is definitely a bit different behaviour regarding handling this though, since switching to curl handler solves the initial issue for me, as well as hardcoding IPv4 address manually in /etc/hosts. On top of the fact that if DNS query was faulty, then curl couldn't possibly work either.

wfurt commented 6 years ago

just for the record, I verified that it can work via IPv6 - if IPv6 works. I think this has been general problem as global site may advertise both IPv4/6 but local network or ISP may not support IPv6. Some systems allows to set preferences: http://sf-alpha.bjgang.org/wordpress/2012/08/linux-prefer-ipv4-over-ipv6-in-dual-stack-environment-and-prevent-problems-when-only-ipv4-exists/ https://wiki.vpsget.com/index.php/Prefer_IPv4_over_ipv6_._How_to_set_ipv4_precedence That may be short term solution.

JustArchi commented 6 years ago

Yeah the fact that I can't connect through IPv6 is definitely my machine issue and I'll solve it in one way or another, this is a technical difficulty that is on me.

I'm just wondering now if there is anything to improve here in this case or we should keep it like that, since technically it's not broken, but it could be improved in a way to work like curl handler did. If people have such "broken" setups (or rather lack of IPv6 without even knowing about it) then they might see regressions on linux with .NET Core 2.1 where curl handler worked just fine with it on .NET Core 2.0, while new socket handler no longer does and times out.

karelz commented 6 years ago

It would be nice to have SocketsHttpHandler more resilient, however, I don't think we should treat it as compatibility problem between curl handler and SocketsHttpHandler - rather as enhancement. (unless we find out lots of people hitting it)

I would strongly recommend to not use the curl handler as workaround, rather use the IPv4 address. Otherwise you will be stuck in the past. Our plan is to eventually get rid of the curl handler entirely (hopefully in next major release).

JustArchi commented 6 years ago

It's alright, thank you a lot for your help, I'll leave this issue open as an enhancement, but feel free to close it if you decide that it's not worth it to improve socket handler in this regard.

In the meantime I'll see how I can fix IPv6 on my machine 🙂

Have a nice day!

EDIT: In case somebody would have similar issue on linux, I added precedence ::ffff:0:0/96 100 to /etc/gai.conf - this tells getaddrinfo() to prefer IPv4 addresses over IPv6 when both are available.

karelz commented 6 years ago

From offline discussion: There is a concern that SocketsHttpHandler does not try all entries returned by DNS. We should check it. That would qualify as something we might need to fix (potentially in servicing).

karelz commented 6 years ago

@rmkerr can you please take a look at SocketsHttpHandler and how it deals with multiple DNS entries?

rmkerr commented 6 years ago

Yep, I'll take a look!

rmkerr commented 6 years ago

Based purely on code inspection this looks correct on the SocketsHttpHandler side. We're using DnsEndpoint with Socket.ConnectAsync: https://github.com/dotnet/corefx/blob/d5f6e6a10b9b1a9cb42066ebceea21e929adb487/src/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/ConnectHelper.cs#L56-L61 The documentation for DnsEndpoint makes it clear that ConnectAsync will attempt to connect to each address until a connection succeeds, regardless of the address family:

The AddressFamily property of any Socket that is created by calls to the ConnectAsync method
will be the address family of the first address to which a connection can be successfully
established (not necessarily the first address to be resolved).

There could still be a bug in Socket or in DnsEndpoint (or the docs), so this warrants more investigation. I'm going to work on reproducing this issue locally, and will update this thread when I have results.

karelz commented 6 years ago

I wonder if we have some logging in Sockets/SocketsHttpHandler that could help us confirm what is happening on @JustArchi's machine. Maybe in combo with Wireshark ...

wfurt commented 6 years ago

How long does it take to try new address @rmkerr ? There was overall 60s timeout on HTTP.

Also from diagnostic @karelz

System.OperationCanceledException: The operation was canceled. at System.Net.Http.HttpClient.HandleFinishSendAsyncError(Exception e,CancellationTokenSource cts) at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)

It would be really nice if this clearly states we were not able to connect. According to @JustArchi IPv6 never worked on his machine so TCP would never been established. Cryptic message like the one above is not that helpful for troubleshooting.

rmkerr commented 6 years ago

@karelz it might be useful to have logs if I can't reproduce the issue. I have not been able to repro the issue on Windows, but I will try on Linux before taking that approach.

@wfurt I'm not sure of the exact time, but it is far less than 60 seconds. When running the app in a console on windows there is no visible delay. I think this is likely a Linux specific issue though, so I will try it there next.

JustArchi commented 6 years ago

@karelz Since my ISP has fixed the issue and my IPv6 route is operative again, sadly I'm unable to assist further due to not being able to reproduce this issue anymore myself, at least physically.

However, I can definitely provide some further details about what exactly was happening:

My linux machine had IPv6 interface up and running, with defined (proper) routes and nothing on the OS level would indicate that the route is non-operative, before in fact using it, such as trying to ping other machine through IPv6.
When tried to ping, requests were timing out, but not showing as unreachable, which confirms that the route was valid and up, but impossible to use. This is different and can be compared to lack of internet connection with router up and running (ISP networking problems) vs unreachable network (your router being powered down / not responding).

Therefore reproducing this issue might be tricky, because normally when there is no IPv6 support or route, then this address is not even being used, or route is immediately announced as being unreachable. You'd need to cause a networking issue on the other side, such as manually dropping all IPv6 packets from given machine to outside world e.g. on router, without the host machine knowing that you're intentionally cutting IPv6 traffic despite of route being up. This is different because it leads to valid connection timing out vs router announcing that the network is unreachable.

To me it looked like HttpClient did the thing right by trying to connect through IPv6 firstly, but due to excessive timeout, IPv4 try never happened, as we reached HttpClient timeout before in fact even trying other DNS endpoint. curl handler could return from that faulty IPv6 connection immediately, and this is why it worked, because IPv4 was tried next. This is only my theory though of how it looked like, the issue might be entirely different, so it should be taken with a grain of salt, but I confirmed that curl indeed does that by verbosely seeing what was happening while my issue was up, with output above, so very likely curl handler right in HttpClient did the very same thing.

Perhaps it could be possible to reproduce this issue on OS level artificially, without even fiddling with the router, using ip6tables:

ip6tables -P INPUT DROP
ip6tables -P FORWARD DROP
ip6tables -P OUTPUT DROP

This could potentially reproduce my issue, since the kernel won't announce packets being dropped and just silently not respond at all. I didn't check if this in fact reproduces the issue, but I think it should. I can try to run into this issue if you'd like to, but only artificially, as physically everything is fine with my machine again.

Hope it helps.

wfurt commented 6 years ago

If you get repro again, could you run this @JustArchi strace -t -f -e connect ~/latest/dotnet run 2>&1 | grep AF_INET

That would show sequence and timing for attempts to connect. I did more digging and in my setup IPv6 fails immediately with ENETUNREACH and that may be different than yours.

JustArchi commented 6 years ago

Hooray! My ip6tables trick in fact did work and reproduced this issue artificially, so I can still be useful after all 🙂

Here are the results (I have local DNS server for name resolve):

root@archi:/tmp/test# strace -t -f -e connect dotnet run 2>&1 | grep AF_INET
[pid 20589] 20:32:02 connect(39, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
[pid 20589] 20:32:02 connect(39, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("216.58.206.238")}, 16) = 0
[pid 20589] 20:32:02 connect(39, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "2a00:1450:4007:815::200e", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
[pid 20589] 20:32:02 connect(37, {sa_family=AF_INET6, sin6_port=htons(443), inet_pton(AF_INET6, "2a00:1450:4007:815::200e", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
# My repro is stuck for around 60 seconds here, exits silently without printing exception due to grepping

root@archi:/tmp/test# DOTNET_SYSTEM_NET_HTTP_USESOCKETSHTTPHANDLER=0 strace -t -f -e connect dotnet run 2>&1 | grep AF_INET
[pid 20882] 20:34:04 connect(36, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
[pid 20882] 20:34:04 connect(36, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("216.58.206.238")}, 16) = 0
[pid 20882] 20:34:04 connect(36, {sa_family=AF_INET6, sin6_port=htons(443), inet_pton(AF_INET6, "2a00:1450:4007:815::200e", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
[pid 20881] 20:34:04 connect(36, {sa_family=AF_INET6, sin6_port=htons(443), inet_pton(AF_INET6, "2a00:1450:4007:815::200e", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
[pid 20881] 20:34:05 connect(37, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("216.58.206.238")}, 16) = -1 EINPROGRESS (Operation now in progress)
# Program finishes here almost immediately without being stuck, repro "OK" not printed due to grepping

I believe you should be able to reproduce it as well by just following my ip6tables trick above. Of course, you still require a machine that has IPv6 enabled and operative.

Let me know if I can help further.

rmkerr commented 6 years ago

Thanks for all the info @JustArchi! I'm able to reproduce this issue on Linux and can almost reproduce it on Windows (more on that later). Rather than taking the iptables approach, I edited my hosts file to add a test hostname with two entries:

0100:f8b0:400a:809::200e local.test.addr
127.0.0.1 local.test.addr

The IPv6 address does not exist, and the IPv4 address can point to any server that will return a response.

On both Windows and Linux we attempt to connect to the IPv6 address first. We receive no response, and attempt several SYN re-transmissions. There appears to be a platform difference here -- we are attempting more re-transmissions on Linux than we are on Windows.

The end result on Windows is that we have a delay of around 30 seconds before connecting to the second address. This is the same behavior as on WinHttpHandler, so at a minimum we do not have a regression on Windows here. (cc: @karelz)

On Linux, it looks like we are either waiting longer between retries, or we are retrying more times. This additional time pushes us over the HTTP timeout imposed in .NET before attempting to connect to the second address. That timeout causes a System.OperationCanceledException to be thrown.

I'm going to do some more work to verify that this is actually what is happening, and then I will look into reducing the number or frequency of TCP re-transmissions on Linux. That should make this behavior consistent cross platform.

stephentoub commented 6 years ago

This additional time pushes us over the HTTP timeout imposed in .NET before attempting to connect to the second address. That timeout causes a System.OperationCanceledException to be thrown.

@rmkerr, so if you set the timeout to be Infinite, it eventually succeeds?

JustArchi commented 6 years ago

Side note, it could also be beneficial to check how exactly curl handler achieves its instant switch and see if it makes sense to implement this, since it's definitely a superior solution compared to 30 seconds delay. I'm shocked myself how well it handled the problem, the switch was almost immediate, like if network was unreachable entirely, but it definitely did try to do something as IPv6 was tried first (confirmed on my curl test above as well as strace output). In fact, this is why I thought at the very beginning of this issue that it was SocketHttpHandler bug and not just a technical issue, since I didn't expect networking problem when just switching the handler did the trick.

This is just a side note though, what matters to me is that you were able to reproduce this, so everything is in good hands now, thank you a lot for taking a look into this 🙂

rmkerr commented 6 years ago

@stephentoub Yes! After several minutes I get a 200 response.

stephentoub commented 6 years ago

it could also be beneficial to check how exactly curl handler achieves its instant switch and see if it makes sense to implement this

I believe libcurl uses Happy Eyeballs: https://en.wikipedia.org/wiki/Happy_Eyeballs and defaults to 200ms for the switch: https://curl.haxx.se/libcurl/c/CURLOPT_HAPPY_EYEBALLS_TIMEOUT_MS.html

We should consider doing that within Socket.Connect{Async} on Unix.

https://tools.ietf.org/html/rfc8305

rmkerr commented 6 years ago

Thanks for the info! I'll try to put together an implementation tomorrow.

rmkerr commented 6 years ago

Hey @stephentoub and @davidsh, I think it could be beneficial to bring Happy Eyeballs over to Windows too. There is still a significant (~20s) delay we could eliminate when connecting. Before I work on the implementation, can you think of anything that might make that infeasible?

davidsh commented 6 years ago

I think it could be beneficial to bring Happy Eyeballs over to Windows too.

Windows winsock connect APIs are supposed to do this already. They are supposed to do parallel IPv4 and IPv6 connect attempts when connecting via DNS name. This feature might only be available in certain newer versions of Windows OS.

rmkerr commented 6 years ago

I could be wrong, but I don't think that we are taking advantage of those APIs in the case demonstrated here. That's based off my reading of the code in MultipleConnectAsync.cs, which contains the classes we'll probably have to update to make this change.

davidsh commented 6 years ago

but I don't think that we are taking advantage of those APIs in the case demonstrated here.

It would probably be easier to just use the built-in support for this instead of trying to do it ourselves and have to manage multiple, simultaneous, socket connect operations.

stephentoub commented 6 years ago

It would probably be easier to just use the built-in support for this instead of trying to do it ourselves and have to manage multiple, simultaneous, socket connect operations.

We need to do it for Unix, anyway.

stephentoub commented 6 years ago

The other option is we do it at the SocketsHttpHandler level. That frees us from mucking with this old APM code in the sockets layer, and it implicitly makes it apply across platforms. But it of course then only applies to SocketsHttpHandler and not to other consumers of Socket.ConnectAsyc… then again, it may not be desirable for all consumers of Socket.ConnectAsync, so maybe we should just start at the SocketsHttpHandler level.

rmkerr commented 6 years ago

I think it would be reasonable to implement at the SocketsHttpHandler level. Doing it that way would also preserve the current behavior of WinHttpHandler, which would help reduce the risk involved. Looking at the code I don't think that it will be too difficult to make the change there, though I will almost certainly have to ask some questions while I work on it.

rmkerr commented 6 years ago

I've put together a functional implementation of Happy Eyeballs in SocketsHttpHandler that you can check out here. I'm going to spend some time tomorrow improving the performance and working on tests. Depending on how that goes I will hopefully have a PR out tomorrow or Monday.

tmds commented 6 years ago

Interesting issue. The root cause is misconfiguration of some sort. This makes a machine use an IPv6 address on a network that doesn't support it. Then even tough the DNS returned multiple addresses, the TCP connection timeout of the OS for 1 address exceeds the HttpClient timeout.

I wonder if we (corefx) are responsible to deal with such misconfiguration issues? And if we do, at what level (TCP/HTTP)? Perhaps we also need to do something about the mismatch between connection timeout and HttpClient timeout?

maisun commented 6 years ago

I noticed one of app using .netcore on my Linux NAS stops getting connection to some URLs which DNS returns IPv6 address - an earlier version of the app worked fine, so something seems to be changed in .netcore recently. Any workaround to this problem?

karelz commented 5 years ago

Does not seem critical for 3.0.

oddstr13 commented 5 years ago

This 10 years old article describes a currently ongoing IPv6 peering dispute between Cogent and Hurricane Electric; https://www.datacenterknowledge.com/archives/2009/10/22/peering-disputes-migrate-to-ipv6

So the issue isn't so uncommon to stumble over in the wild, unfortunately.

Any progress on the implementation of happy eyeballs or similar mechanisms? Or would it be best to try to deal with this on the application side for now?

karelz commented 5 years ago

Triage: While doable via dialers (#35404), we believe it is worth to do it on its own. It should be done in Sockets.Connect - maybe as an option, where SocketsHttpHandler opts into it all the time.

Note that Windows has the Happy Eyeballs feature (when resolving as name), we just don't use it. We have to decide if we want to do it ourselves x-plat, or just on Linux.

karelz commented 4 years ago

Dialers are checked in #1793 ... however, we need a few DNS APIs to make it easy. Workaround: PInvoke into the DNS APIs. Moving to 6.0

dotnet / runtime

Consider using Happy Eyeballs or similar in SocketsHttpHandler #26177