dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.25k stars 1.58k forks source link

Slow HttpClient operation caused by DNS resolution failure on iOS #41451

Closed rxwen closed 3 years ago

rxwen commented 4 years ago

The issue has been discussed in flutter, move it here since it's more relevant.

Our environment is iOS version: 13.3.1. iPhone 6S. (And some customers reported same problems on different devices, iPhone 7, iPhone X, latest iOS system as of now).

As a reference, I created a flutter sample project, with

flutter create --androidx -t app sample

Then, add some code in main.dart to perform a simple http get request, whenever user clicks the + button and display the time spent. (https://gist.github.com/rxwen/009c7ae4328ee799f71013f06e206b13#file-main-dart-L55) This simplest example, exhibit the same behavior.

Took about 5 second on wifi (On the same device, using safari can open the same page instantaneously under the same wifi connection)
Took about 500 ms on 4G

I run sample on flutter v1.12.13+hotfix.8, v1.9.1 in debug and release mode. All got the similar results.

Further testing showed that it may be related to dns. I tried change the url to its ip address instead of domain name, the speed isn't slow at all. And if I set the iPhone's dns to manual mode, and remove the ipv6 dns server (which is configured automatically), the problem is gone too.

So, here is the question, how come the ipv6 dns server will affect the flutter app's dns resolution? While other apps are not affected. It's not a feasible solution if we have to ask our app user to configure the dns manually when they encounter this problem. There still should be something different in dart code base that handle the dns in a different way.

rxwen commented 4 years ago

The problem is with dns resolution, I changed the sample to perform dns resolution as: `
var sw = new Stopwatch()..start();

final ia = await InternetAddress.lookup("baidu.com");

sw.stop();

_counter = sw.elapsed.inMilliseconds;

` It costs like 5 seconds to run.

The automatic dns configuration is:

image

If I removed the IPv6 dns server manually, or add a custom dns server (8.8.8.8), the resolution speed is ok.

The dig command test showed the 192.168.1.1 server is ok, and the ipv6 server doesn't.

rxwen commented 4 years ago

The router is a openwrt, it doesn't have upstream ipv6 connection.

Looks like when there are both ipv4 and ipv6 LAN connections on the iOS, the flutter performs dns resolution only with ipv6 (which should fail because there is no ipv6 upstream connection), but ignores the ipv4 dns server. Other applications still respect the ipv4 dns server setting, so they will not fail.

rxwen commented 4 years ago

Capture network traffic on iPhone showed that whenever I asked the app to lookup ip address, it always started a AAAA type lookup and failed. The A type lookup only happened once with a success result, and the result should have been cached.

The InternetAddress.lookup method waited until the type AAAA resolution failed eventually after 5 seconds and then returned. Thought it should be able to return the cached result quicker.

And I'm not sure if the ipv6 ip address is the root cause. As a user of the app reported the slow behavior, while his iPhone didn't obtained a ipv6 address at all.

If https://github.com/dart-lang/sdk/blob/master/runtime/bin/socket_base_linux.cc#L202 is the code for iOS platform, looks like it's implemented on top of getaddrinfo function.

a-siva commented 4 years ago

/cc @ZichangG

zichangg commented 4 years ago

From your description, it is likely to be relative to our POSIX socket operations. In dart, we handle IOS and MacOS the same way with read,write and other POSIX sockets methods, which is discouraged by IOS man page.

@rxwen Can you verify that this happens only on IOS but not MacOS?

Some discussion: https://github.com/dart-lang/sdk/issues/41376

rxwen commented 4 years ago

I'm not that familiar with macos app development. So a side question is, do I need to follow this guide (https://flutter.dev/desktop) to try a macos app?

We haven't received any report on android platform with this issue so far.

zichangg commented 4 years ago

You'll need to run the same code on MacOS. Not so familiar with flutter setup either but your link looks good to me.

If you have dart(which should be included in Flutter), you can just run dart path/to/your_program.dart on Mac.

Android's implementation is quite similar but not enough to draw the conclusion. They use different eventhandlers and details are slightly different, which might contain a bug.

rxwen commented 4 years ago

Sure. Will try when I'm in the reproducible environment and let you know.

My initial suspect is could it be possible that it's blocked here till timeout? If it's caused by the posix socket, does it mean that the flutter code is waiting for io_service's response on a socket without being able to return earlier?

zichangg commented 4 years ago

My initial suspect is could it be possible that it's blocked here till timeout? If it's caused by the posix socket, does it mean that the flutter code is waiting for io_service's response on a socket without being able to return earlier?

It is possible. For lookup, it is blocking. Connection will be issued after addresses are resolved. Checkout the implementation here: https://github.com/dart-lang/sdk/blob/5bf1223ea93138c55146840dc06277cf8068662f/sdk/lib/_internal/vm/bin/socket_patch.dart#L456

rxwen commented 4 years ago

Is the original design of the lookup to be blocking in the case that the underlying dns has already got a ip for the domain name? Could the lookup be async and the slow response could simply update the dns cache when it eventually arrives?

zichangg commented 4 years ago

Is the original design of the lookup to be blocking in the case that the underlying dns has already got a ip for the domain name?

As you pointed out, we wrapped info and passed into getaddrinfo. The behavior is controlled by system call. You should check getaddrinfo man page.

Could the lookup be async and the slow response could simply update the dns cache when it eventually arrives?

Apologies for using blocking in previous comment. We do put it into a Future so that it won't blocked. What I meant is following connections are concatenated to this Future and will be executed afterwards. Once lookup hangs, the following part will be delayed as well.

sortie commented 4 years ago

@rxwen Why is there an invalid IPv6 DNS server in your configuration? That seems like the problem. You're saying it's automatically configured? Maybe there's something wrong with your network and the addresses it's broadcasting? In this case, Dart is using the system getaddrinfo function which is outside of our control and is supposed to do something reasonable. In that sense, this is not a problem with Dart, and we can't do anything.

Although, as @ZichangG mentions, there is another iOS networking API that we could use, and it might resolve addresses differently. It's still a problem that there's a broken server in the DNS configuration, but this other API may smooth over the problem with concurrent DNS lookup.

rxwen commented 4 years ago

@sortie The reason I'm having a IPv6 DNS server is because my router supports ipv6, but my Internet port doesn't. So IPv6 works in LAN. I doubted its caused by my network configuration as well. But what I don't understand is, only the flutter apps (our pratical app or the simple demo I mentioned before) are affected, all other apps were working fine. And this is the only environment that I have access to that can reproduce the issue. As you may notice in my previous mentioned, some users of our app reported the same behavior (about 5 seconds delay for each network access), and the information I gathered from them are:

  1. Other apps are not affected.
  2. Some of them helped check DNS configuration, there is no IPv6 dns server configured.

Since I can't do further investigation in users' environment. I have to use whatever I can gather so far to find out the cause. At least find out a explanation to these questions above. And is there possibility that the flutter app doesn't work well with some kind of iOS network configuration?

Also, I tried in a different network environment that is similar in that it also has a IPv6 Lan, but no IPv6 external ip address. I can confirm there is a IPv6 dns server configured in the same iPhone device. But in this environment, the flutter apps are not slow.

mraleph commented 4 years ago

FWIW, Apple networking guidelines have the following section Avoid Resolving DNS Names Before Connecting to a Host, which rather aptly describes the problem we are hitting here.

Our HttpClient should probably perform two DNS resolutions in parallel (instead of doing a single getaddrinfo hoping to resolve both IPv6 and IPv4) and then use whatever comes first (maybe with a some small timeout to accommodate for small delays).

rxwen commented 4 years ago

@mraleph I thought the HttpClient might use InternessAddressType.Any internally. The type is passed to getaddrinfo, which was instructed to perform do two DNS resolutions in parallel.

mraleph commented 4 years ago

@rxwen yeah, but I suspect that things hang because getaddrinfo waits to get both IPv6 and IPv4 lookup results back. Could you try changing your sample test to do:

await InternetAddress.lookup("baidu.com", type: InternetAddressType.IPv4)

and see if it hangs just the same or returns result fast?

rxwen commented 4 years ago

I already tried this before. It returned immediately.

rxwen commented 4 years ago

@ZichangG I run the code below on mac:

import 'dart:io';
import 'dart:async';
import 'dart:core';
void main() async{
    print("hello world!");
    var ip = await InternetAddress.lookup("baidu.com");
    print("$ip");
}

It returned immediately. But on a iPhone in the same wifi. It's still very slow. Does it give any hint to you?

rxwen commented 4 years ago

I captured network package in the other network environment which also has a IPv6 LAN address. It looks like the AAAA dns query got a response soon, so the httpclient isn't blocked. It means the httpclient is indeed blocked due to the AAAA resolution timeout.

I'll try to get more information from our user in their environment and let you know.

zichangg commented 4 years ago

Looks like getaddrinfo is stuck.

There is probably no quick fixes except a shorter timeout.

Rewriting our IOS socket implementation with CFSocket and CFHost mechanism should probably work for this case. But as I mentioned in another case #41376, it also has some limitations on enabling VPN.

mraleph commented 4 years ago

@ZichangG

There is probably no quick fixes except a shorter timeout.

There are two ways to look at this problem:

I suggest we look at the second interpretation. For that there is rather straightforward fix - we should slightly rework how _NativeSocket.startConnect is implemented. Instead of issuing a single InternetAddress.lookup with type: any it should issue two separate lookups and use results of whatever comes first to start connecting. There is already logic inside that tries addresses one by one (if host has multiple addresses). The logic with multiple DNS lookups fits rather well into that.

rxwen commented 4 years ago

From a user's perspective, the second suggested way sounds logic to me. As a user of HttpClient, we don't have to worry about the details. The interface of it is straight forward to us. I guess, if the change is made in underlying dart:io, other network applications like a plain tcp connection will be relieved too.

sortie commented 4 years ago

Would multiple DNS lookups actually solve the problem? We can ask for an IPv4/IPv6 address separately, yes, but that doesn't necessarily mean we use an a IPv4/IPv6 server unless iOS getaddrinfo specifically does that. For instance, I imagine asking for an IPv4 could make iOS getaddrinfo ask its IPv6 server first, and once it times out, then ask the IPv4 server and get a response.

Additionally if we did two concurrent lookups, which response do we prefer? The one that came first? The IPv6 one? The one the OS believes to have the best route? We'll be bypassing / reimplementing the system policy here.

In my opinion, this is a libc deficiency where getaddrinfo doesn't cope with a broken IPv6 DNS server efffieciently. We can work around it, potentially, but we're not quite on the right level for this kind of policy, and it would be an imperfect workaround. We do want to have some sort of solution since this results in real user problems.

Since other iOS apps don't have this problem, it sounds like we're using the wrong API to do DNS lookups. If there's a higher level API we can use that takes care of abstracting these policies away, that would the best option for a clear solution.

mraleph commented 4 years ago

Would multiple DNS lookups actually solve the problem?

Yes. I asked what happens if you specifically ask IPv4 address and I got the answer that it resolves quick.

Since other iOS apps don't have this problem, it sounds like we're using the wrong API to do DNS lookups.

My guess other apps are following the Apple guidelines to not resolve hostnames before connecting and letting library handle that. See link from this comment.

We can debate things back and forth - but I don't think it is a good spend of our time, when there is a rather small fix to apply now, and this fix would unbreak people.

After unbreaking people we can spend time contemplating perfect solution (e.g. large scale refactoring of dart:io to use non-POSIX API on iOS and Mac OS X).

sortie commented 4 years ago

Yes. I asked what happens if you specifically ask IPv4 address and I got the answer that it resolves quick.

Super. I missed that - just wanted to confirm that workaround worked. I'm on board with concurrent lookup then, and ideally switching to a higher level API.

mraleph commented 4 years ago

@ZichangG do you think you could take a stub at changing _NativeSocket.startConnect as described above?

zichangg commented 4 years ago

Sure. But I don't know how to test on IOS.

zichangg commented 4 years ago

While I'm working on the fix, I just realized InternetAddress has a public lookup(). If I do the hack for startConnect, it will still hang if someone calls lookup. I think this concurrent lookup just makes the problem surface less if most users use HttpClient() or some other higher level apis. I'm not sure whether we should proceed with this solution. Any idea? @sortie @mraleph

sortie commented 4 years ago

The workaround should be done as close as reasonably possible to the getaddrinfo call as it doesn't have anything to do with the socket connect code. I suggest making InternetAddress.lookup() on iOS do the lookup concurrently with two futures of the backend.

mraleph commented 4 years ago

As I said before we have two ways to look at the problem (see https://github.com/dart-lang/sdk/issues/41451#issuecomment-613872570). I feel like just fixing Socket connections is going to cover a lot of issues. Most people don't really use InternetAddress.lookup directly, instead they just use Socket.connect or even more high level APIs like HttpClient.

Futhermore, fixing InternetAddress.lookup requires diverging from just using getaddrinfo directly and making decisions around how long the timeout should be etc.

Compare this with fixing _NativeSocket.startConnect - it already have logic inside which tries addresses one by one - adding logic there which would resolve hostnames concurrently seems totally natural.

Based on these two thoughts I think it is okay to fix _NativeSocket.startConnect directly, rather than attempt to fix lookup itself.

ivanryndyuk commented 4 years ago

I have exactly the same issue... Post requests through http were taking from 7s to 20s+ That took a long way to debug and find out what's wrong, from glitchy circular progress indicator to this dns resolution thing... I hope you guys will be able to fix this soon!

In the meantime I've found kinda workaround that helped me to increase the http response speed. May be that would help someone who faced the same problem.

This workaround requires writing platform channel method and implementing http requests through native iOS URLSessionDataTask. So now it takes 1-2s and that's the result I can afford! I can share the code if needed, it's pretty simple though.

rxwen commented 4 years ago

@ivanryndyuk It's great if you can share the code.

I believe it should work around the issue, since the situation happens only within flutter app.

ivanryndyuk commented 4 years ago

@rxwen OK, that's what I've got so far... just to get the idea should be enough.

In Flutter app Dart portion:

static const iosHttp = const MethodChannel('org.example/http');

Future<void> post(
    String url,
    Map<String, dynamic> body,
    Map<String, String> headers,
) async {
    if (Platform.isIOS) {
        final result = await iosHttp.invokeMapMethod(
            'post', 
            {
                'url': url,
                'body': body,
            'headers': headers,
        }
        );
        //todo: Do whatever you want with the result, it's a json response.
    } else {
        //todo: Send post request normally with `http` package.
    }
}

and the iOS part in AppDelegate.swift...

// Put this inside `application:didFinishLaunchingWithOptions:`
if let controller = window.rootViewController as? FlutterViewController {
    let httpChannel = FlutterMethodChannel(
        name: "org.example/http", 
        binaryMessenger: controller.binaryMessenger
    )
    httpChannel.setMethodCallHandler { [weak self] (call, result) in
        if call.method == "post" {
            self?.sendPostRequest(arguments: call.arguments, result: result)
        } else {
            result(FlutterMethodNotImplemented)
        }
    }
}

// Add this private method to AppDelegate
private func sendPostRequest(arguments: Any?, result: @escaping FlutterResult) {
    do {
        guard let params = arguments as? [String: Any] else {
            throw FlutterError()
        }
        guard let urlString = params["url"] as? String else {
            throw FlutterError()
        }
        guard let body = params["body"] else {
            throw FlutterError()
        }
        guard let headers = params["headers"] as? [String: String] else {
            throw FlutterError()
        }
        guard let url = URL(string: urlString) else {
            throw FlutterError()
        }
        var request = URLRequest(url: url)
        request.httpMethod = "POST"
        request.allHTTPHeaderFields = headers
        request.httpBody = try JSONSerialization.data(
            withJSONObject: body,
            options: .prettyPrinted
        )
        let task = URLSession.shared.dataTask(with: request) { data, _, err in
            guard let data = data, err == nil else {
                DispatchQueue.main.async {
                    result(err ?? FlutterError())
                }
                return
            }

            guard let json = try? JSONSerialization.jsonObject(
                with: data,
                options: .init(rawValue: 0)
            ) else {
                DispatchQueue.main.async {
                    result(err ?? FlutterError())
                }
                return
            }

            DispatchQueue.main.async {
                result(json)
            }
        }
        task.resume()
    } catch {
        result(error)
    }
}

I've omitted initialization parameters for FlutterError() to make code more readable in the example. Also it's required to make FlutterError conforming to Error protocol in order to be able to throw it or you can use custom Error. Also example requires better exception handling but don't judge too harsh. This approach helped me, I hope will help someone else too.

aviadavi commented 4 years ago

Guys, we are talking about DNS resolution. shouldn't it be done once per URL? After doing one call that takes 10s , shouldn't the application takes the value from the DNS cache? Is that what happening? or for every call to same URL it takes 10s+ ?

rxwen commented 4 years ago

@aviadavi This is what I suppose the DNS to work. Unfortunately, every call takes 10s+.

I guess is because the IPv6 resolution failed, thus subsequent call still needs to try resolving.

zichangg commented 4 years ago

@rxwen Could you please try my fix patch(https://dart-review.googlesource.com/c/sdk/+/143883)? I don't have IOS device to test it. If this solves your problem, I can land it.

rxwen commented 4 years ago

@zichangg I'd like to try it. But I'm not familiar with how to apply the patch on flutter environment, to build a ios app based on it.

I know how to apply patch for my flutter environment. It looks like flutter command will download the dart engine and sdk binaries automatically. So, I'm not sure how to instruct the flutter command to use a patch-ed sdk. Do you have any guide on this?

zichangg commented 4 years ago

@rxwen Thanks for the help. ^_^ As far as I know, there is no easy way. You either have to have flutter engine or Dart sdk.

If you do have flutter engine repo locally, dart sdk lives in ${flutter_engine_repo}/third_party/dart. You can apply the patch there(A simple copy-paste should be enough). This page talks on how to build engine on IOS. You can substitute the step 1 & 2 with applying the patch.

@zanderso Hi Zach, do you know any easier way of rebuild dart from flutter?

rxwen commented 4 years ago

I'll look into it and let you know if it works. @zichangg

rxwen commented 4 years ago

@zichangg I cloned the engine repo. But there isn't a third_party/dart directory. Am I getting the correct repo?

zanderso commented 4 years ago

@rxwen You'll have to follow https://github.com/flutter/flutter/wiki/Setting-up-the-Engine-development-environment to get everything in the right place to apply the patch.

zichangg commented 4 years ago

@rxwen I think you should follow what Zach has posted. He is an expert!

100cm commented 4 years ago

any progress on this?

zichangg commented 4 years ago

@100cm I have a fixing cl ready but I'm not able to test whether that fixes the problem. Thanks to @rxwen , he is trying to test my patch.

rxwen commented 4 years ago

@zichangg There are so many tools new to me. It took time to download the whole repo and setup the build environment for it.

Looks like I'm stucked in the flutter/tools/gn --unoptomized step. It complains about python3 compatibility. Then I modify the gn tool source code to force it use python2, it failed with following error:

$ ./flutter/tools/gn --unoptimized
Generating GN files in: out/host_debug_unopt
ERROR at //build/config/ios/ios_sdk.gni:38:9: Script returned non-zero exit code.
        exec_script("ios_sdk.py", [ "iphonesimulator" ], "list lines")
        ^----------
Current dir: /Users/lisa/projects/engine/src/out/host_debug_unopt/
Command: python /Users/lisa/projects/engine/src/build/config/ios/ios_sdk.py iphonesimulator
Returned 1.
stderr:

  File "/Users/lisa/projects/engine/src/build/config/ios/ios_sdk.py", line 12
    print "Takes one arg (SDK to find)"
                                      ^
SyntaxError: invalid syntax

See //build/toolchain/mac/BUILD.gn:9:1: whence it was imported.
import("//build/config/ios/ios_sdk.gni")
^--------------------------------------
See //BUILD.gn:6:1: which caused the file to be included.
group("default") {
^-----------------

Any idea what is the cause?

zichangg commented 4 years ago

@rxwen Thanks for the effort. I'm not quite familiar with it either. Did you pass --ios as well. (See the instructions in the IOS section).

Looks like sys.argv of running python /Users/lisa/projects/engine/src/build/config/ios/ios_sdk.py iphonesimulator is not 2. Does this get changed by forcing python 2?

if len(sys.argv) != 2:
  print "Takes two args (SDK to find and the version)"
mraleph commented 4 years ago

@zichangg @zanderso maybe somebody with appropriate Mac hardware could produce a local engine build with a patch for @rxwen and package it? I have done this before with android builds - you just need to package a subset of output files to make --local-engine work?

zanderso commented 4 years ago

@cbracken mentioned to me that @gaaclarke can help.

@gaaclarke Could you do a local engine build with @zichangg's patch on the Dart SDK, and get it to @rxwen to try out? Thanks!

edit: @jmagman has done this already, sorry for the confusion!

jmagman commented 4 years ago

I have it set up, I was waiting on Friday for @zichangg for details on which networking scenarios needed testing. (see internal Flutter & Dart chat)

jmagman commented 4 years ago

I tried to reproduce this without the patch first. My first thought was to use the Network Link Conditioner to increase the IPv6 DNS delay to simulate a timeout. However, I don't see enough knobs in there to do that while also allowing through IPv4 without a delay, so with and without the patch every lookup took the full delay amount.

My macOS corp machine doesn't allow network sharing to set up anything more sophisticated there to then share with my iOS device. And if I misconfigure my network during the quarantine I don't think anyone in my house will be pleased.

@rxwen Any hints for how I can reproduce this? Alternatively, I can share the engine binaries I built in Google Drive, if you have a gmail address. If you don't want to share it here, you can give me a 👋 on Discord in the #hackers-dart channel. Or you can try to build the engine again with python3 (you didn't specify what error you were hitting).