Performance bottleneck while opening many http connections on GetHttpConnectionAsync

aviviadi commented 6 years ago

at https://github.com/dotnet/corefx/blob/master/src/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs#L229

When opening many connections, i.e. to a aspnet server, the above lock degrading the total requests processing speed, as each request has to wait to the previous one to finish GetHttpConnectionAsync

Repro: Simple aspnetcore kestrel server, wrk -c <many>, and dotTrace will show System.Net.Http.HttpConnectionPool.GetConnectionAsync(HttpRequestMessage, CancellationToken) heavy locking time, see https://1drv.ms/u/s!Am06xdRFNnQEiH1smgyjT2Qmir8R

davidsh commented 6 years ago

cc: @geoffkizer

geoffkizer commented 6 years ago

A couple questions:

What sort of hardware are you running this on? What's the client app here that's calling into HttpClient?

It's definitely possible that the connection pool lock in GetConnectionAsync could be causing contention. We haven't really seen this in the perf testing we've done, but it's always been something we've been keeping an eye on.

aviviadi commented 6 years ago

Processor Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz, 4201 Mhz, 4 Core(s), 8 Logical Processor(s)
The client is a webHost (using WebHostBuilder) as follows:

var config = new ConfigurationBuilder()
                .AddJsonFile("hosting.json", optional: true)
                .AddEnvironmentVariables(prefix: "ASPNETCORE_")
                .AddCommandLine(args)
                .Build();

            var webHostBuilder = new WebHostBuilder()
                .UseContentRoot(Directory.GetCurrentDirectory())
                .UseConfiguration(config)
                .UseStartup<Startup>()
                .ConfigureServices(services => services
                    .AddSingleton(new ConsoleArgs(args))
                    .AddSingleton<IScenariosConfiguration, ConsoleHostScenariosConfiguration>()
                    .AddSingleton<Scenarios>()
                )
                .UseDefaultServiceProvider(
                    (context, options) => options.ValidateScopes = context.HostingEnvironment.IsDevelopment())
                .UseKestrel()
                .UseUrls("http://0.0.0.0:5000");

            var threadCount = GetThreadCount(config);

            webHostBuilder.UseSockets(x =>
            {
                if (threadCount > 0)
                {
                    x.IOQueueCount = threadCount;
                }

                Console.WriteLine($"Using Sockets with {x.IOQueueCount} threads");
            });

            var webHost = webHostBuilder.Build();

            Console.WriteLine($"Server GC is currently {(GCSettings.IsServerGC ? "ENABLED" : "DISABLED")}");

            webHost.Run();

With my middleware of course.

ayende commented 6 years ago

The scenario we are testing is a benchmark of the client application hosted in Kestrel talking to another server. The "problem" is that we are very efficient in the server, so requests duration are small (the server we hit can handle upward of 1 million req / sec). On the other hand, the HttpClient usage (and we use a single HttpClient instance) means that we spend over 7% of our time just waiting for the lock because we have concurrent requests trying to start get a connection all the time.

aviviadi commented 6 years ago

geoffkizer commented 6 years ago

Thanks, that's super helpful.

@aviviadi Could you expand GetConnectionAsync in your trace? I'm curious what shows up underneath it.

ayende commented 6 years ago

Pretty much nothing, most of the time is spent inside the method. This looks like it spend all of its time in the lock

geoffkizer commented 6 years ago

Do you have a link to the actual source code (i.e. code invoking HttpClient)?

geoffkizer commented 6 years ago

I'm curious whether you also see a lot of time in HttpConnectionPool.ReturnConnection. Assuming that lock contention is the root issue here, I'd expect this routine to suffer from contention as well.

ayende commented 6 years ago

I was able to create a small reproduction for this.

I'm running this command: ./wrk -c 1024 -t 32 http://localhost:5000/hop

And here is the code.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Hosting;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.DependencyInjection;

namespace WebApplication1
{
    public class Program
    {
        public static void Main(string[] args)
        {
            new WebHostBuilder()
                .UseKestrel(options =>
                {
                    options.ListenAnyIP(5000);
                })
                .UseStartup<Startup>()
                .Build()
                .Run();
        }
    }

    public class Startup
    {
        // This method gets called by the runtime. Use this method to add services to the container.
        // For more information on how to configure your application, visit https://go.microsoft.com/fwlink/?LinkID=398940
        public void ConfigureServices(IServiceCollection services)
        {
        }
        HttpClient httpClient = new HttpClient();

        // This method gets called by the runtime. Use this method to configure the HTTP request pipeline.
        public void Configure(IApplicationBuilder app, IHostingEnvironment env)
        {
            app.Run(ProcessRequest);
        }

        private async Task ProcessRequest(HttpContext context)
        {
            var path = context.Request.Path;
            switch (path)
            {
                case "/hop":
                    using (var result = await httpClient.GetAsync("http://localhost:5000/direct"))
                    using (var stream = await result.Content.ReadAsStreamAsync())
                    {
                        await stream.CopyToAsync(context.Response.Body);
                        await context.Response.WriteAsync(" from hop");
                    }
                    break;
                case "/direct":
                    await context.Response.WriteAsync("Hello World!");
                    break;
                default:
                    await context.Response.WriteAsync("Hello World! " + path);
                    break;
            }
        }
    }
}

ayende commented 6 years ago

Here are the profiler results:

As you can see, in this demo, over 75% of the time is spent in System.Net.Http.HttpConnectionPool.GetConnectionAsync(HttpRequestMessage, CancellationToken)

I'm not seeing ReturnConnection in the profiler output at all.

geoffkizer commented 6 years ago

Thanks for the repro, that's super helpful. That should allow me to reproduce something similar.

geoffkizer commented 6 years ago

I put together a simple command line app that does something similar to the code above. It spawns 1000 tasks to do HttpClient.GetAsync in a loop, and it talks to a simple socket-based server over loopback.

I see 14.5% of CPU time in GetConnectionAsync. There is definitely some contention happening in this scenario; 3.1% of time is spent in kernelbase!WaitForMultipleObjectsEx.

I also see 7.8% of CPU in ReturnConnection.

The main cost in GetConnectionAsync (aside from locking overhead) is EnsureReadAheadAndPollRead, at 6.9%. This is doing a kernel call to issue the read. We should avoid doing this underneath the lock.

Looking at the GetConnectionAsync code, we will also initiate a new connection under the lock, which will also do a kernel call. This is much less common, since it only happens when we don't have any idle connections in the pool. Regardless, we should probably avoid doing this under the lock as well.

The ReturnConnection CPU time is entirely in acquiring the lock. When a connection limit is set, we will also do EnsureReadAheadAndPollRead in ReturnConnection, which we should avoid if possible. But since there's no connection limit here, this doesn't happen.

geoffkizer commented 6 years ago

cc @stephentoub FYI

geoffkizer commented 6 years ago

@ayende My profiler results are quite different than yours. Based on the fact that you seem to have call counts, I'm assuming you're using some sort of tracing profiler, as opposed to a sampling profiler. This also probably means that the times reported are elapsed time, as opposed to CPU time. This would explain why you see 75% of time in GetConnectionAsync, while I see 14.5%.

Generally speaking, sampling profilers are much more useful for understanding throughput issues than tracing profilers.

ayende commented 6 years ago

@geoffkizer Yes, I used tracing sample for this. I typically care about algorithmic complexity so that is my default setting.

However, I tested this with sample profiling and got similar results, see:

The command used was ./wrk -c 1024 -t 8 http://localhost:5000/hop with the same code as before.

To recreate it, you can create a .NET Core console app, add the Kestrel reference and run it.

ayende commented 6 years ago

My machine have 20 cores, which might be a factor?

karelz commented 6 years ago

@ayende question on a side: What is the impact you see on real-world scenario (not just micro benchmark targeting the problem)? It will help us assess the priority of the perf problem for general .NET Core users. Thanks!

ayende commented 6 years ago

We run into this when doing load test of our client API. In RavenDB client API, we use HttpClient to make REST calls to RavenDB's server.

The RavenDB server can do upward of a million requests per second but when we load tested an application using RavenDB through the client API we were barely able to hit 10,000 request a second.

This has real world impact in the sense that is drastically limits the scalability that we ca provide our users.

Given that HttpClient is also the proper way to handle any HTTP interactions in .NET core, this has wide ranging implications. It means that micro service architecture is inherently limited, for example.

This is a hard performance blocker for us.

stephentoub commented 6 years ago

This is a hard performance blocker for us.

What does this mean? You're saying it's a regression that's blocking you from upgrading?

As you can see, in this demo, over 75% of the time is spent in ... means that we spend over 7% of our time just waiting for the lock... using RavenDB through the client API we were barely able to hit 10,000 request a second...

I'm not sure how to interpret this. Your comments seem to suggest that fixing this issue would let you go from 10,000 rps to 1,000,000 rps, but that's certainly not going to be anywhere near true if fixing this would eliminate "only" 75% of the time (and obviously it's unlikely that would drop entirely to 0) or 7% of the time (I'm not sure how to rationalize your 75% and 7% comments).

We should avoid doing this underneath the lock.

At one point I had a branch that completely eliminated the lock on the fast path when getting a connection that already existed, no max connection count, etc. (the lock was still needed when returning). I never submitted a PR because there wasn't a demonstrated need, and it looks like I deleted the branch. I can try to recreate it when I'm back.

ayende commented 6 years ago

@stephentoub It is a blocker because we want to get the client to fully utilize the server. Right now, we can't get anywhere close. To give more context, we are trying to get TechEmpower benchmark results with the hope of getting to the top ranks. This is a stumbling block along that path.

The 75% / 85% vs. 7% difference are because of very different scenarios. The 75% - 85% difference is on the small repro that I posted above.

The 7% repro is when we tested the full client, we have other inefficiencies (in our code) that are responsible for that, which we are working to fix.

karelz commented 6 years ago

@ayende what is the % impact on your real-world scenario? That is the key information we need. Clearly it is NOT the entire 10K vs. 1M RPS.

ayende commented 6 years ago

I expect it to be 20% of the total time once the current set of optimizations is done. Will be able to report exact metrics on that next week

karelz commented 6 years ago

Great, thanks! That will be helpful for further porting discussions (once we have a fix)!

geoffkizer commented 6 years ago

@ayende

My machine have 20 cores, which might be a factor?

Yes, that's a factor. I'm running on an 8 proc machine.

Still, I don't understand why you're not seeing any time in ReturnConnection. If the lock is contending (and it appears that it is) then you should see equal contention in both places.

Also, if there's an option in your profiler to show native frames as well, that would be super useful. That would allow us to see the exact amount of time spent in the locking routine.

geoffkizer commented 6 years ago

@stephentoub

At one point I had a branch that completely eliminated the lock on the fast path when getting a connection that already existed, no max connection count, etc. (the lock was still needed when returning). I never submitted a PR because there wasn't a demonstrated need, and it looks like I deleted the branch. I can try to recreate it when I'm back.

I think we can get away with a simpler change here. The main problem seems to be EnsureReadAheadAndPollRead. We can just move from GetConnectionAsync (where it's under the lock) to SendWithRetryAsync. If it fails, we'll just call GetConnectionAsync again and acquire the lock again.

I think there's more we could do here to minimize lock contention entirely, but we can always revisit that later.

geoffkizer commented 6 years ago

The test code I used above is here: https://github.com/geoffkizer/teststuff/tree/master/poollock

dotnet / runtime

Performance bottleneck while opening many http connections on GetHttpConnectionAsync #27153