Reduce the impact of blocking DNS calls on Unix

davidfowl commented 3 years ago

After helping a customer look at a thread pool starvation case on linux on .NET Core 3.1 I ended up here. After doing some research and with some discussion on Twitter, it turns out that getaddrinfo_a uses an internal thread pool and blocks on getaddrinfo and isn't doing any async IO. This change is an improvement over what we had before because our threadpool doesn't grow but I'm not sure this change is a net positive in the long run. The thread pool limits are controlled by compile time constants in glibc (essentially, another library is doing async over sync for us on a less controllable threadpool...).

I wonder if we're better off controlling this blocking code and maybe it should be possible to turn this off with a configuration switch.

The other improvement I was thinking about was only allowing one pending request to a specific host name concurrently. That would improve situations where DNS is slow and new blocking calls are issued for the same host name (which is the case the customer ran into) on thread pool threads.

cc @geoffkizer @stephentoub @scalablecory

benaadams commented 3 years ago

This would work fine in the specific example here (async DNS) because we know we're about to invoke a (potentially) blocking call.

But I've seen lots of code that ends up blocking somewhere deep in a call stack, and callers aren't even aware this is happening. In this case, the code is executing on the regular threadpool and it's not obvious how it would be moved to a separate threadpool with different execution semantics.

That's why I think the solution here has to address arbitrary blocking code on the regular thread pool.

Which was https://github.com/dotnet/runtime/pull/47366 was about; to detect the blocking then apply queuing mitigations for that call path

VSadov commented 3 years ago

Yes, there was a prototype where:

workers were decoupled from task queues so that adding a worker would not change the cost of dispatching/stealing tasks.
workers performing potentially blocking calls (Sleep and WaitForSingleObject for starters) were not counted as active while in the call. If the pool needs a worker during that, it could activate another worker within the same quota.
threads that block natively in unknown system calls could still drain the pool, but it would be detected by the Gate thread, with much shorter latency, since adding a thread is less of a deal.

I think Go handles blocking calls in a similar way, except in our case threads are always 1:1 with OS threads.

Occasional need of extra 10-100 threads was easily tolerated. I had tests that did random Sleep(100) in tasks and yet completing without minute-long hiccups.

As I see it - If you have a call that often blocks, let's say Sleep(100) for simplicity. And let's say you must call it 100 times and there is no way around that. - then you can do it concurrently, or you can do it sequentially. In concurrent case you need more threads, which you can create, within reason. In sequential case you need more wallclock time and you can't create that.

There are obviously other costs to adding a thread and hogging apps will eventually see them. Starvation tolerance is a plan-B feature, to be used after plan-A, which is "use async".

tmds commented 3 years ago

After doing some research and with some discussion on Twitter, it turns out that getaddrinfo_a uses an internal thread pool and blocks on getaddrinfo and isn't doing any async IO.

Several distros use systemd-resolved instead of the default glibc DNS implementation. It would be interesting to know if this limitation applies to systemd-resolved as well?

Are there open bugs for the issues you're running in with system DNS?

If we implemented our own resolver we would probably need to do the same, which isn't ideal.

Yes, if there is a managed implementation, it should be opt-in. The system DNS is aware of configuration stuff the managed implementation would not know about. For example, systemd-resolved knows what domain names are on my VPN.

davidfowl commented 3 years ago

Seems like it's asynchronous based on this text https://www.freedesktop.org/software/systemd/man/systemd-resolved.service.html

dotnet / runtime

Reduce the impact of blocking DNS calls on Unix #48566