dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.96k stars 4.65k forks source link

Reduce the impact of blocking DNS calls on Unix #48566

Closed davidfowl closed 3 years ago

davidfowl commented 3 years ago

After helping a customer look at a thread pool starvation case on linux on .NET Core 3.1 I ended up here. After doing some research and with some discussion on Twitter, it turns out that getaddrinfo_a uses an internal thread pool and blocks on getaddrinfo and isn't doing any async IO. This change is an improvement over what we had before because our threadpool doesn't grow but I'm not sure this change is a net positive in the long run. The thread pool limits are controlled by compile time constants in glibc (essentially, another library is doing async over sync for us on a less controllable threadpool...).

I wonder if we're better off controlling this blocking code and maybe it should be possible to turn this off with a configuration switch.

The other improvement I was thinking about was only allowing one pending request to a specific host name concurrently. That would improve situations where DNS is slow and new blocking calls are issued for the same host name (which is the case the customer ran into) on thread pool threads.

cc @geoffkizer @stephentoub @scalablecory

ghost commented 3 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details
After helping a customer look at a thread pool starvation case on linux on .NET Core 3.1 I ended up [here](https://github.com/dotnet/runtime/pull/34633). After doing some research and with some [discussion on Twitter](https://twitter.com/tehjh/status/1363336079793922050?s=20), it turns out that `getaddrinfo_a` uses an internal thread pool and blocks on `getaddrinfo` and isn't doing any async IO. This change is an improvement over what we had before because our threadpool doesn't grow but I'm not sure this change is a net positive in the long run. The thread pool limits are controlled by compile time constants in glibc (essentially, another library is doing async over sync for us on a less controllable threadpool...). I wonder if we're better off controlling this blocking code and maybe it should be possible to turn this off with a configuration switch. The other improvement I was thinking about was only allowing one pending request to a specific host name concurrently. That would improve situations where DNS is slow and new blocking calls are issued for the same host name (which is the case the customer ran into) on thread pool threads. cc @geoffkizer @stephentoub @scalablecory
Author: davidfowl
Assignees: -
Labels: `area-System.Net`, `area-System.Threading`, `untriaged`
Milestone: -
davidfowl commented 3 years ago

Also, most other platforms have a similar issue here but usually offer the "built in OS version" and another version of the API. See node's excellent documentation here:

https://nodejs.org/docs/latest/api/dns.html#dns_implementation_considerations

stephentoub commented 3 years ago

After helping a customer look at a thread pool starvation case on linux on .NET Core 3.1 I ended up here.

Just to be clear, that implementation hasn't shipped yet in any supported .NET release; it was only merged in January for .NET 6.

The other improvement I was thinking about was only allowing one pending request to a specific host name concurrently. That would improve situations where DNS is slow and new blocking calls are issued for the same host name (which is the case the customer ran into) on thread pool threads.

This is largely orthogonal. Regardless of how it's implemented, a "coalesce all pending calls for a given address into one" can be added, layered on top of the underlying operation via a transient address->task cache (that's possible for pretty much any kind of async operation). It is more impactful, though, the more resources are consumed by a pending operation, and if that includes a blocked thread, that's relatively large.

This change is an improvement over what we had before because our threadpool doesn't grow but I'm not sure this change is a net positive in the long run.

I agree, though if the implementation is exactly as you describe, I'm not even sure it's a worthwhile improvement over what we had before. Such an implementation has two main benefits over using the same pool:

  1. If the caller synchronously blocks on the async operation, this change eliminates the dependency within the same resource pool, reducing the chances of stalls.
  2. It effectively changes prioritization by giving this set of operations a dedicated set of processors rather than being prioritized against all other works item in the pool.

But it also has cons over using the same pool, namely that those other-pool threads end up competing for resources with the main thread pool, which can't use or manage or scale them. It also further bifurcates our implementations, adding some non-trivial interop code that only works on Linux; on macOS, for example, we still just do our own async-over-sync. Interestingly, there's a follow-up PR in https://github.com/dotnet/runtime/pull/47036 to add cancellation to that Linux code, but that PR calls out that all it supports is removing the pending entry from a queue, and once it's in flight, it can't be canceled: that fits with the implementation you describe. And if that's all it's doing, enabling that same form of cancellation with our own async-over-sync should be close to a one-line change (passing the cancellation token as an argument to the task: https://github.com/dotnet/runtime/blob/e5c7b036a094a420040c706b4fd275c2cf153a09/src/libraries/System.Net.NameResolution/src/System/Net/Dns.cs#L635).

If we rule out caring about callers doing sync-over-async with the resulting task (I'm not saying we can discard that, just that doing so is problematic code, anyway), then I'd expect just reverting back to what we had previously would be a better choice; in typical usage, it'll effectively become as if the call was just made sync, with the pool thread plucking the item it just queued out of its local queue. (And then consider the aforementioned coalescing, which could help in some circumstances but add overhead in others.)

cc: @gfoidl

gfoidl commented 3 years ago

Oh, that's interesting and makes #34633 less useful, as pointed out. (It seems to be a bad glibc-API with room for improvement, also in that docs...)

A different way to mitigate this is to have a look at https://c-ares.haxx.se (pointed to from the twitter-discussion mentioned above). I just had a quick look at license (MIT-based) and docs, and it looks like c-ares provides all the functionality we need (incl. cancellation). Additionally it's not glibc-based, so async DNS resolution could be used on more platforms.

(There are other libs listed but none of them seems fine to use -- either license or implementation).

If this is an option, I can try to make a prototype to see how things go. So the steps would be

  1. revert #34633 and close the PR that would add cancellation
  2. make the prototype that uses c-ares
  3. productize the c-ares wrapper for libraries
davidfowl commented 3 years ago

@stephentoub

Just to be clear, that implementation hasn't shipped yet in any supported .NET release; it was only merged in January for .NET 6.

Yep, I'm aware.

This is largely orthogonal. Regardless of how it's implemented, a "coalesce all pending calls for a given address into one" can be added, layered on top of the underlying operation via a transient address->task cache (that's possible for pretty much any kind of async operation). It is more impactful, though, the more resources are consumed by a pending operation, and if that includes a blocked thread, that's relatively large.

Right. It is orthogonal but I think it would reduce the chance of starvation caused.

I agree, though if the implementation is exactly as you describe, I'm not even sure it's a worthwhile improvement over what we had before.

šŸ‘šŸ¾ I think we should revert.

davidfowl commented 3 years ago

@gfoidl

A different way to mitigate this is to have a look at https://c-ares.haxx.se (pointed to from the twitter-discussion mentioned above). I just had a quick look at license (MIT-based) and docs, and it looks like c-ares provides all the functionality we need (incl. cancellation). Additionally it's not glibc-based, so async DNS resolution could be used on more platforms.

Yes I think this might be worth looking at (it also has wide platform support). This is what nodejs uses under the covers for dns.resolve but dns.lookup uses getaddrinfo. I dug up the history of problems with the async over sync model in nodejs and it caused lots of problems which lead them to implement dns.resolve (though they are some subtle differences).

The other option is a fully managed version of DNS resolution. It would also help us improve our UDP stack. This is the approach that golang and mono took (cc @migueldeicaza).

I think no matter what we do, we'd need a way to "use OS defaults" in case people configure custom DNS servers and want their .NET apps to automagically use them.

stephentoub commented 3 years ago

This is the approach that golang and mono took

https://github.com/dotnet/runtime/issues/25793

davidfowl commented 3 years ago

2 birds with one stone then! https://github.com/dotnet/runtime/issues/19443

geoffkizer commented 3 years ago

I think that if we want to switch implementations of our DNS client, then a full-managed implementation is probably the best option -- as opposed to relying on a 3rd party native library. We could use the mono code as the basis for this (though I haven't looked at it and have no idea how complete/performant it is).

There are some other DNS-related features we could enable if we did that. One is retrieving TTL info. Another is discovering HTTPSVC records, see https://github.com/dotnet/runtime/issues/43643.

wfurt commented 3 years ago

I think the biggest difficulty would be match OS behavior. e.g. I feel it would be be strange if OS resolution gives different answers than .NET. DNS it self is not that hard IMHO. But on Unix name resolution can be configured by cascade of nss modules. There are additional directives in /etc/resolv.conf besides just name servers. On some systems, systemd can synthesize records. On macOS resolution can come form mDNS aka multicast. Perhaps project for runtime labs?

geoffkizer commented 3 years ago

I think the biggest difficulty would be match OS behavior. e.g. I feel it would be be strange if OS resolution gives different answers than .NET.

Yeah, I agree. Golang lets you choose -- you can use OS resolution or you can use their custom name resolution. If we implemented our own resolver we would probably need to do the same, which isn't ideal.

Perhaps project for runtime labs?

Yeah, this would be a great project for runtime labs.

davidfowl commented 3 years ago

OK so summary:

gfoidl commented 3 years ago

Revert the current async change

I'll revert and close the related PRs.

ManickaP commented 3 years ago

Triage: Scoped down to the third bullet:

davidfowl commented 3 years ago

Why is this future? Do we think its too risky?

wfurt commented 3 years ago

future does not exclude now/6.0. But we would not hold release for it @davidfowl

davidfowl commented 3 years ago

That seems fine but I would leave it in 6.0 until the end of the milestone, then it could be moved. Otherwise there's no way to tell the difference between we will look at it in 6.0 if we have time and we're not going to look at it in 6.0.

davidfowl commented 3 years ago

https://github.com/uber/denial-by-dns has a great test suite

stephentoub commented 3 years ago

Here's an example of what this could look like: https://github.com/stephentoub/runtime/commit/282f62ef18c7144812eddaf3dfa32bde7565bfa4 and the positive and negative impact it can have (note that my commit relies on https://github.com/dotnet/runtime/pull/48842):

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Linq;
using System.Net;
using System.Threading.Tasks;

[MemoryDiagnoser]
public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

    [Benchmark]
    public async Task DnsSerial()
    {
        for (int i = 0; i < 1_000; i++)
            await Dns.GetHostAddressesAsync("www.microsoft.com");
    }

    [Benchmark]
    public async Task DnsConcurrent()
    {
        for (int i = 0; i < 100; i++)
            await Task.WhenAll(Enumerable.Range(0, 10).Select(i => Dns.GetHostAddressesAsync("www.microsoft.com")));
    }

    [Benchmark]
    public async Task DnsUniqueParallel()
    {
        await Task.WhenAll(Enumerable.Range(0, 4).Select(i => Task.Run(async () =>
        {
            string address = i switch
            {
                0 => "www.microsoft.com",
                1 => "www.bing.com",
                2 => "dotnet.microsoft.com",
                _ => "azure.microsoft.com",
            };

            for (int i = 0; i < 1_000; i++)
                await Dns.GetHostAddressesAsync(address);
        })));
    }
}
Method Toolchain Mean Ratio Allocated
DnsSerial /master/corerun 187.44 ms 1.00 141 KB
DnsSerial /pr/corerun 190.32 ms 1.02 391 KB
DnsConcurrent /master/corerun 42.83 ms 1.00 210 KB
DnsConcurrent /pr/corerun 18.59 ms 0.43 132 KB
DnsUniqueParallel /master/corerun 212.65 ms 1.00 658 KB
DnsUniqueParallel /pr/corerun 214.92 ms 1.02 1,658 KB

On the pros side, concurrent requests for the same address are able to coalesce to the same task, which means no additional allocation for that request, no additional threads blocked for that request, and for some of the requests significantly reduced time to execution because they may come in at the tail end of the actual DNS request finishing and be able to use it. If you were in a bad situation where you were blocking tons of thread pool threads all for the same address, this would help a lot.

On the cons side, the implementation adds relatively large allocation overheads, since when a task isn't reused, it's still paying for the ability for it to be, which means the allocation inside of the ConcurrentDictionary for the relevant node, plus the additional ContinueWith allocations to be able to remove the node from the dictionary, plus an extra continuation in the task which basically guarantees the task will need to upgrade from being able to store the caller's await continuation in its continuation field to instead needing to allocate a list to store the multiple continuations. It also means some contention for mutations on the dictionary. And for cancelable Dns requests that do have cancellation requested, that are stuck in the thread pool's queue, and that aren't reused will no longer be removed from the thread pool's queue.

There are obviously possible variations on my implementation, but they all have similar tradeoffs. For example, we could enable ref counting on the tracked requests in order to be able to cancel them if all associated requests had cancellation requested, but that would incur another relatively large allocation increase.

davidfowl commented 3 years ago

Some possibilities:

I think we have to pick one of these tradeoffs, I don't think "do nothing" is an answer here.

davidfowl commented 3 years ago

Oh another option, is to queue a limited number of these to the thread pool, then start spawning new threads. That might be simpler.

stephentoub commented 3 years ago

I don't think "do nothing" is an answer here.

I mean, we've "done nothing" here for 20 years. :wink: It wasn't until 2018 in .NET Core 2.1 that the Windows implementation switched to using async I/O down to the OS; prior to that (and even since then on non-Windows), it just queued a work item to the thread pool. What's changed that makes "do nothing" not an answer? I'm actually asking.

With regards to possible solutions, what specific problem / worst case are we trying to solve? Again, actual question, because that impacts what's done. For example, you linked to https://github.com/uber/denial-by-dns above, and it calls out a test where you queue N > threads-in-pool DNS requests with an unreachable DNS server and then issue a DNS request against localhost. Just throttling all DNS requests is not going to make that test pass any faster than it does today... it'll actually make the test take longer than it does today. Coalescing does help that test.

I'm having trouble coming up with a problem for which just throttling DNS requests is the right answer. I could see it being added in combination with coalescing.

Oh another option, is to queue a limited number of these to the thread pool, then start spawning new threads. That might be simpler.

And quite expensive for the case where there's nothing wrong.

davidfowl commented 3 years ago

What's changed that makes "do nothing" not an answer? I'm actually asking

davidfowl commented 3 years ago

I don't have the answer but DNS resolution is special because it's one of the places where we explicitly do async over sync and as a customer you wouldn't know until you ran into a problem. This isn't even documented, it's just a something that happens. What's worse is that it usually happens via HttpClient....

stephentoub commented 3 years ago

I don't have the answer

I'm still trying to understand the problem the problem we're trying to solve :) For example, is it that DNS becomes unreachable and every DNS request regardless of hostname blocks indefinitely? Is it that DNS is for some reason stalling on a particular hostname but not others? Is it that DNS is just slow but the 99% case is that all requests are for the same hostname? Etc. What was the case with the example you're citing from the past week, for example... what were the conditions that led to starvation there? And, the OS isn't caching DNS results, or this is only happening in situations where the OS isn't able to use its cache?

davidfowl commented 3 years ago

I want thread pool starvation not to occur because dns calls are taking long to resolve. I want it to only block future dns resolutions, nothing else in the system should die.

This is what the custom sent:

The root cause is: one of our dns pods lost its network, but k8s still consider it is alive.

This was in kubernetes. One of the dns server pods stopped working and dns queries were hanging from specific pods.

The code in the app that started to fail was indirect HttpClient calls via the azure sdk.

Which lead me on this research project to figure out the dns landscape in the industry and what our competition does.

That outage resulted in thread pool starvation. Other platforms run into different problems but they try their hardest to make it only affect this subsystem.

The reason the cache idea came to mind is because in this outage, they kept making calls to the same hostname. I also assume that the number of unique host names in typical apps like this are small < 20. That seemed like a simple idea to throttle new work.

geoffkizer commented 3 years ago

Other languages/runtimes (specifically Golang and Node) provide their own DNS client implementation that does real async. I think this is something we should consider doing too, but (a) it's a fair amount of work and (b) it has it's own tradeoffs (which is why Golang allows you to choose whether you use OS DNS resolution or their own DNS client).

So in the short term, I think the relevant question is: can we improve things here while still relying on OS DNS resolution, which is synchronous on Linux?

I want thread pool starvation not to occur because dns calls are taking long to resolve.

+1. My first hope here is that some of the general thread pool improvements we have talked about for 6 would help here. That seems much preferable to a DNS-specific solution, if it works. In fact I'd think that this scenario should be an explicit target for those improvements.

If that can't happen, maybe we should discuss stuff like optimizing per hostname... but I really, really hope that our thread pool improvements here would make this better.

davidfowl commented 3 years ago

Other languages/runtimes (specifically Golang and Node) provide their own DNS client implementation that does real async. I think this is something we should consider doing too, but (a) it's a fair amount of work and (b) it has it's own tradeoffs (which is why Golang allows you to choose whether you use OS DNS resolution or their own DNS client).

I have lots of information here about what other platforms do and what knobs they have (erlang being the most novel so far šŸ˜„ ). I think we all agreed on building a managed DNS client (which makes me very happy).

So in the short term, I think the relevant question is: can we improve things here while still relying on OS DNS resolution, which is synchronous on Linux?

+1. My first hope here is that some of the general thread pool improvements we have talked about for 6 would help here. That seems much preferable to a DNS-specific solution, if it works. In fact I'd think that this scenario should be an explicit target for those improvements.

I'm 100% behind this and we can even use DNS as a test case for any improvement we make to thread injection.

If that can't happen, maybe we should discuss stuff like optimizing per hostname... but I really, really hope that our thread pool improvements here would make this better.

I agree with you but I think what I like about what i've learned about our competitors is that they did try to "do the best they can" to make sure the OS blocking code paths aren't terrible. That's how I want to believe we feel as well and why I care that we do something in this space.

geoffkizer commented 3 years ago

erlang being the most novel so far

You can't just post that and then not give details :)

davidfowl commented 3 years ago

First, Erlang (or the BEAM VM) has 2 implementations, one built in erlang that scales and one that uses the OS APIs. For the OS specific implementation, Erlang spawns another process and communicates via a named pipes on windows and domain sockets on *nix. This process is responsible for doing the DNS query and sending the result back to the erlang process . This out of process command has a default timeout to stop it from running too long and erlang will kill the process and return a failure to the caller after some time. It does this to make sure that blocking threads don't make the erlang process unhealthy.

stephentoub commented 3 years ago

My first hope here is that some of the general thread pool improvements we have talked about for 6 would help here.

I'm a little skeptical it would have the desired impact, though it does depend heavily on the details of the scenario. Let's say we made the pool aware of this blocking and increased how aggressively it injected threads. In the extreme, I issue 10K concurrent Dns calls for the same host. The fact that those operations are related isn't known to the pool or its heuristic, nor does it understand a relationship between work items when they're queued. The most knowledge about what's being done here is in the Dns client itself: it knows it's doing something naughty, and it understands how work items it's created relate, so it is best positioned to compensate. Just making the pool more aggressive could actually do the very wrong thing here (unbounded thread growth that topples the system while not actually unblocking other unrelated work). And new APIs that provided information to the pool would still require Dns use them, at which point I suspect we'd see little difference over centralized support vs just limiting Dns' allowed concurrency via a semaphore or concurrency-limited scheduler. I'm obviously speculating and could be wrong, so I'm not saying we shouldn't explore it, just that if we believe we need to address Dns specifically (because it's a known entity scheduling async over sync IO work used on mainstream code paths and without better recommended alternatives) and care about large bursts of activity, we shouldn't count on it.

The code in the app that started to fail was indirect HttpClient calls via the azure sdk.

This gets to the heart of my scenario "what problem are we trying to solve" question. Do we expect that the 99% case to mitigate is trying to resolve just one / a handful of hosts, as would be the case in that example (Azure's host name(s))? If yes, then serializing/coalescing to a single blocked thread by hostname is going to give good results (on top of that if the thread pool could detect or be told about non-managed blocking/IO, it could compensate for the one-ish additional threads). But if the solution needs to account for many distinct addresses being resolved concurrently, this isn't going to help. At that point you then also need to throttle overall thread usage by the Dns client. So, which scenario are we trying to fix? The first, the second, both? Both solutions are simple to implement, and if the main thing we care about is just a few concurrent addresses (presumably that's the majority case for microservices in general as well?), we can just add the serialization-per-host and be done with it for now. If we need to throttle overall, we'll need to figure out the right limit, as the wrong one could negatively impact an otherwise well- functioning app.

ps Geoff pointed out that we shouldn't coalesce requests as that would disrupt OS round robin of results. But we can instead serialize requests asynchronously per hostname. This does measurably regress issuing many requests in parallel (as you're going from N overlapping operations to 1 at a time), but hopefully at a scale that doesn't matter in practice.

davidfowl commented 3 years ago

My hypothesis is that you will solve most starvation cases by handling the first case. I would posit that most applications need to make < 10 concurrent dns requests to unique host names so throttling per host name makes the most sense to mitigate the most common cases.

Doing only the latter wouldn't make sense to me but I could see it in addition to the first. There would be a setting on the dns class that limited the number of unique host names requests that could be in flight at the same time (I'm getting service point manager vibes)

stephentoub commented 3 years ago

I would posit that most applications need to make < 10 concurrent dns requests to unique host names so throttling per host name makes the most sense to mitigate the most common cases.

That is my opinion as well. How do we validate that hypothesis?

There would be a setting on the dns class that limited the number of unique host names requests that could be in flight at the same time (I'm getting service point manager vibes)

I don't want to do add API for that. Any such mitigation would be temporary, only intended for platforms where we don't yet have an async I/O implementation, but we'd end up having to apply it to all implementations and support it forever more. If we really needed our default to be overridable in case something goes awry, we could respect a temporary env var that gets removed in the future. But, unless concrete evidence dictated it was really critical, I'd rather not add such global throttling at all.

davidfowl commented 3 years ago

That is my opinion as well. How do we validate that hypothesis?

Do a survey? Ask a couple customers if they have this in their telemetry over the course of a single app instance run?

Seems like doing this would only change the second point about limiting the number of overall concurrent requests, not the per name queuing.

If we really needed our default to be overridable in case something goes awry, we could respect a temporary env var that gets removed in the future.

That sounds even better.

davidfowl commented 3 years ago

Actually thinking about it more, I don't think anyone would know how many they need concurrently based on telemetry. I think they would need to know based on what the app does and how it issues requests. Bare in mind we're only talking about concurrent dns requests not even the themselves http requests.

We should also add an event source event that tracks the number of pending unique dns requests and concurrent dns requests overall.

stephentoub commented 3 years ago

This is what serialization per host/type might look like: https://github.com/stephentoub/runtime/commit/107d09ed178197d88da75f004c72df04d6b1d2dd

Same benchmarks as earlier:

Method Toolchain Mean Ratio Allocated
DnsSerial \master\corerun.exe 186.77 ms 1.00 141 KB
DnsSerial \pr\corerun.exe 189.54 ms 1.02 289 KB
DnsConcurrent \master\corerun.exe 43.38 ms 1.00 210 KB
DnsConcurrent \pr\corerun.exe 179.33 ms 4.13 436 KB
DnsUniqueParallel \master\corerun.exe 201.55 ms 1.00 658 KB
DnsUniqueParallel \pr\corerun.exe 209.52 ms 1.04 1,252 KB

We should also add an event source event that tracks the number of pending unique dns requests and concurrent dns requests overall.

You mean counters, right? Seems reasonable. We don't have any counters yet for System.Net.NameResolution, do we? I think we'd also just want overall DNS request count. I assume "concurrent dns requests overall" just means how many requests are currently in flight at that moment, regardless of hostname. Probably worth a separate issue for such telemetry. https://github.com/dotnet/runtime/issues/48885

geoffkizer commented 3 years ago

The most knowledge about what's being done here is in the Dns client itself: it knows it's doing something naughty, and it understands how work items it's created relate, so it is best positioned to compensate.

Yeah, you convinced me that this is a reasonable thing to do. In particular, in this case we can know we are issuing requests that are strongly related in their blocking characteristics -- if the first blocks, then every subsequent one will block too until the underlying DNS query is completed.

That said, this doesn't do anything solve the general problem. DNS requests to different hostnames can still block, and this can lead to starvation. And I doubt that it would make sense to serialize all DNS requests, since requests to different hostnames are basically unrelated.

Just making the pool more aggressive could actually do the very wrong thing here (unbounded thread growth that topples the system while not actually unblocking other unrelated work).

We can tune the heuristics however we want to avoid unbounded thread growth.

To me, the interesting question here is: could the thread pool do better at avoiding starvation if it knew that some threads were blocked, by being somewhat more aggressive about spinning up more threads? I think the answer here is quite likely yes.

Having info about thread blocking could have other benefits, too. Like, if it were comprehensive enough, it might allow us to reduce the number of threads executing when it is pretty sure there is no blocking happening, and thus improve overall performance.

And new APIs that provided information to the pool would still require Dns use them, at which point I suspect we'd see little difference over centralized support vs just limiting Dns' allowed concurrency via a semaphore or concurrency-limited scheduler

The difference is that other components can use that same API. A DNS specific solution only helps if DNS is the sole problem. I suspect that's not at all the case; often it's a different component entirely, or it's user code, or it's a mix of a lot of things that add up.

davidfowl commented 3 years ago

To me, the interesting question here is: could the thread pool do better at avoiding starvation if it knew that some threads were blocked, by being somewhat more aggressive about spinning up more threads? I think the answer here is quite likely yes.

I think yes, for async over sync, we should run these on a different set of threads.

geoffkizer commented 3 years ago

we should run these on a different set of threads.

Why different threads? Threads are threads. What I think matters is whether any given thread is blocking at the moment, or not.

davidfowl commented 3 years ago

Why different threads? Threads are threads. What I think matters is whether any given thread is blocking at the moment, or not.

Treating all work as homogeneous is why the thread pool behaves like it does today. Work segregation is a very common way to isolate workloads of different types.

While it doesn't need to be a different physical OS thread, it needs to be treated differently and there's a non trivial cost to adding a thread pool thread (work stealing...).

geoffkizer commented 3 years ago

Treating all work as homogeneous is why the thread pool behaves like it does today.

+1

there's a non trivial cost to adding a thread pool thread (work stealing...)

Yeah, good point; I'm not sure how work stealing fits in here.

I have heard that there is some consideration being given to make work stealing queues per-CPU instead of per-thread. @jkotas @kouvel is that correct?

stephentoub commented 3 years ago

We've experimented with it in the past. @VSadov had a prototype.

stephentoub commented 3 years ago

and there's a non trivial cost to adding a thread pool thread

There's also a cost to competing pools.

geoffkizer commented 3 years ago

Work segregation is a very common way to isolate workloads of different types.

Work segregation requires code to make an up-front decision about whether to queue a new Task to threadpool X (e.g. the regular threadpool) vs threadpool Y (e.g. the threadpool for stuff that may block) vs threadpool Z (whatever).

This would work fine in the specific example here (async DNS) because we know we're about to invoke a (potentially) blocking call.

But I've seen lots of code that ends up blocking somewhere deep in a call stack, and callers aren't even aware this is happening. In this case, the code is executing on the regular threadpool and it's not obvious how it would be moved to a separate threadpool with different execution semantics.

That's why I think the solution here has to address arbitrary blocking code on the regular thread pool.

davidfowl commented 3 years ago

The first part is absolutely right. This case is async over sync where we want to throttle the work so it doesn't burn the rest of the thread pool. The other scenario, sync over async usually want to inject more threads with the hope that things resolve itself.

Though, obviously there are cases where injecting more threads can be worse.

davidfowl commented 3 years ago

I think that's why if you look at our competition, they usually do something similar (to the throttle) when calling getaddr_info. They don't inject more threads to handle the load.

geoffkizer commented 3 years ago

I think that's why if you look at our competition, they usually do something similar (to the throttle) when calling getaddr_info. They don't inject more threads to handle the load.

Do they do this for sync p/invokes in general?

geoffkizer commented 3 years ago

if you look at our competition

BTW, I assume you mean Golang and Node.js here; if you mean something else let me know.

The thing that these frameworks do that's very different from us is that they have very hard guarantees that work on the "regular threadpool" will never block.

davidfowl commented 3 years ago

Do they do this for sync p/invokes in general?

They handle this case differently to the general case of blocking.

BTW, I assume you mean Golang and Node.js here; if you mean something else let me know.

The thing that these frameworks do that's very different from us is that they have very hard guarantees that work on the "regular threadpool" will never block.

Yes, they have different principles that's right but still need to handle the general case of blocking. They just really go out of their way to avoid it happening (for good reason) especially when they know there's no way around it (like calling getaddrinfo).

geoffkizer commented 3 years ago

They handle this case differently to the general case of blocking.

As I said above, I'm happy to throttle/synchronize/whatever specific kinds of blocking calls when it makes sense because we have knowledge of what the calls are doing and how they are likely to block, etc. The per-hostname serialization here is a good example of it.

But I don't think this addresses the general problem.