dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.44k stars 10.01k forks source link

Kestrel SSL/UseHttps Performance Issue #7081

Closed allevyMS closed 5 years ago

allevyMS commented 5 years ago

Describe the bug

I have been investigating throughput issues with our ASPNet.Core 2.2 webapp and found that using SSL with Kestrel is a major bottleneck for performance.

To Reproduce

I have created the following minimal app:

Program.cs

using System.Net;
using System.Security.Cryptography.X509Certificates;
using Microsoft.AspNetCore;
using Microsoft.AspNetCore.Hosting;

namespace aspnetcoredemo
{
    public class Program
    {
        public static void Main(string[] args)
        {
            CreateWebHostBuilder(args).Build().Run();
        }

        public static IWebHostBuilder CreateWebHostBuilder(string[] args) =>
            WebHost.CreateDefaultBuilder(args)
                .UseKestrel(options: options =>
                {
                    options.Listen(IPAddress.Any, 80);
                    options.Listen(IPAddress.Any, 443, listenOptions =>
                    {
                        listenOptions.UseHttps(new X509Certificate2("mypfxfile.pfx"));
                    });
                })
                .UseStartup<Startup>();
    }
}

Startup.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Hosting;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.DependencyInjection;

namespace aspnetcoredemo
{
    public class Startup
    {
        // This method gets called by the runtime. Use this method to add services to the container.
        // For more information on how to configure your application, visit https://go.microsoft.com/fwlink/?LinkID=398940
        public void ConfigureServices(IServiceCollection services)
        {
        }

        // This method gets called by the runtime. Use this method to configure the HTTP request pipeline.
        public void Configure(IApplicationBuilder app, IHostingEnvironment env)
        {
            if (env.IsDevelopment())
            {
                app.UseDeveloperExceptionPage();
            }

            app.Run(async (context) =>
            {
                await context.Response.WriteAsync("Hello World!");
            });
        }
    }
}

I ran it using dotnet version:

Host (useful for support): Version: 2.2.1 Commit: 878dd11e62

ASPNet.Core version: 2.2.1

On OS: PRETTY_NAME="Debian GNU/Linux 9 (stretch)" NAME="Debian GNU/Linux" VERSION_ID="9" VERSION="9 (stretch)" ID=Debian

I used Apache Bench to run a load test on both endpoints with 300 requests and with concurrency set to 100 like so: ab -k -n 300 -c 100 http://localhost:80/ ab -k -n 300 -c 100 https://localhost:443/

results for port 80 without SSL: Connection Times (ms)

min mean [+/-sd] median max
Connect: 0 51 76.6 3 299
Processing: 98 530 259.2 602 995
Waiting: 98 484 229.9 598 995
Total: 102 580 244.1 602 997

results for port 443 with SSL: Connection Times (ms)

min mean [+/-sd] median max
Connect: 0 10344 8969.0 7400 43487
Processing: 1 815 892.2 702 12901
Waiting: 0 705 517.9 600 2307
Total: 993 11160 8959.1 8601 43787

As you can see the results are pretty damning for SSL which performs quite a lot worse than I would expect. These results are consistent with additional tests I have run (external to the host and using various load test approaches) and with our production and dev environments metrics (we tested both with and without using Kestrel with SSL).

Expected behavior

Better performance using Kestrel with SSL

Additional context

I am running my webapp behind nginx as a reverse proxy but I still require internal SSL encryption. The current performance using Kestrel with SSL is hurting our production environment throughput in a big way.

yanrez commented 5 years ago

@karelz - let us know if this is wrong place to post this issue, unclear if cause is in aspnetcore or in underlying stack.

To clarify impact: we have a multitenant service deployed into k8s, and we need to handle a spike in http requests (thousands of simultaneous calls). These requests are flowing through nginx on ingress, nginx does ssl termination, but we also have to encrypt traffic between ingress and upstream. So our service has SSL configured in Kestrel. Due to impact of this issue, we can't handle more than 60 requests per pod before nginx starts hitting 5 sec timeout for ssl handshake with upstream

karelz commented 5 years ago

This may be in SslStream, which is my team (in CoreFX repo). Let's first get some measurements identifying the likely root cause. It may be in the layer using SslStream in Kestrel. We've got anecdotal feedback SslStream is slow in the past, but never with any repro or hard evidence.

What are the perf results difference of SSL vs. non-SSL? (certain diffence is expected) If it is truly SslStream bottleneck, then we should be able to remove ASP.NET layer on the server side and show more than expected overhead in SSL ...

@Eilon who is your perf guru who could help narrowing down the root-cause?

cc @davidfowl @stephentoub @geoffkizer

davidfowl commented 5 years ago

@halter73 @Tratcher

We need to collect a trace so we can narrow down what the problem might be. There's a tool here you can use to collect a trace on linux and open up in perfview:

https://github.com/dotnet/diagnostics/tree/master/src/Tools/dotnet-collect

Unfortunately, we don't have builds of this tool flowing yet so you may have to build it from source.

cc @vancem

Tratcher commented 5 years ago

@sebastienros you've already got metrics for linux https, no?

halter73 commented 5 years ago

We do have metrics:

https://msit.powerbi.com/view?r=eyJrIjoiYTZjMTk3YjEtMzQ3Yi00NTI5LTg5ZDItNmUyMGRlOTkwMGRlIiwidCI6IjcyZjk4OGJmLTg2ZjEtNDFhZi05MWFiLTJkN2NkMDExZGI0NyIsImMiOjV9&pageName=ReportSection5ea9e281d572d9092cc7

https://msit.powerbi.com/view?r=eyJrIjoiYTZjMTk3YjEtMzQ3Yi00NTI5LTg5ZDItNmUyMGRlOTkwMGRlIiwidCI6IjcyZjk4OGJmLTg2ZjEtNDFhZi05MWFiLTJkN2NkMDExZGI0NyIsImMiOjV9&pageName=ReportSection5ea9e281d572d9092cc7&pageName=ReportSection30725cd056a647733762

Our plaintext numbers show about 3.6 million RPS on linux without TLS and about 2.0 million RPS on linux with TLS/SslStream. For our Json benchmark, the numbers are about 730k RPS vs 480k RPS without and with TLS respectively.

yanrez commented 5 years ago

Does TLS/SslStream measurement assumes already established ssl connection or each request includes handshake?

halter73 commented 5 years ago

We use wrk as our benchmark client which establishes a fixed number of connections (we usually use 256 connections AFAIK) up front and then reuses those connections for all subsequent requests in that run unless theres an error or the server closes the connections.

ab -k should do the same thing since "-k" is used to enable HTTP keep-alive with ab.

halter73 commented 5 years ago

One big difference between the way we're benchmarking is caused by your use of ab's "-n" flag to set the number of requests for the benchmarking session.

We make millions of requests when collecting even a single benchmark result. This means that the cost of the 256 TLS handshakes are heavily amortized over millions of requests in our TLS benchmarks.

Base on my understanding of ab, "ab -k -n 300 -c 100" will perform 100 TLS handshakes, but only make 300 requests. This leaves you with only 3 requests per TLS handshake which I would never expect to perform nearly as well as a similar benchmark without any TLS handshakes.

sebastienros commented 5 years ago

Correct that we have never measured TLS handshakes specifically :/

halter73 commented 5 years ago

we have never measured TLS handshakes specifically :/

It's probably a good time to start measuring this so we can catch regressions. It shouldn't require any changes to the app being benchmarked. Using a lua script so wrk sets a Connection: close request header should be sufficient to benchmark the handshake plus the relatively-small cost of a single request.

yanrez commented 5 years ago

I'll let @allevyMS respond with more details tomorrow. Looks like our case of nginx on ingress and few pods in upstream, and getting couple thousand calls ends up being represented by "ab -k -n 300 -c 100" pretty well, since we end up doing a lot of handshakes and don't benefit from warmed up connections. Could be an interesting gap to close in benchmarks and tracking improvements :)

karelz commented 5 years ago

Is there anything we can actually affect in the handshake? Unless we invoke it multiple times, or do something truly horrible, it is purely OS behavior and performace - the expected SSL overhead. We should compare it to other implementations (non .NET), that would tell us if we're really behind.

halter73 commented 5 years ago

@yanrez Since you're using nginx for ingress, could you use nginx's keepalive directive with increased keepalive_requests and keepalive_timeout values to pool nginx-to-kestrel TLS connections and reduce the number of handshakes?

https://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive

allevyMS commented 5 years ago

@halter73 More info about our setup: we have 32 ingress controllers that serve traffic to many other apps in addition to ours. We have 6 app pods. We don't have continuous and sustained traffic to maintain the keepalive connections with a reasonable keepalive timeout.

The reason we started looking into this is that we have a use case where customers set up LogicApps that create 1k-2k concurrent requests in one burst and there was a 70% failure rate with 502s and 504s.

@karelz I setup Node.js with https on the same host and ran Apache bench with the same inputs: ab -k -n 300 -c 100 https://localhost:8443/

It seems to perform considerably better:

Connection Times (ms)

min mean [+/-sd] median max
Connect: 1199 2384 471.1 2502 3589
Processing: 594 1111 182.0 1100 1601
Waiting: 296 618 148.0 602 903
Total: 2704 3495 359.1 3614 5190
stephentoub commented 5 years ago

I'm curious; any way you could try this with a preview of .NET Core 3.0 and making sure that OpenSSL 1.1+ is installed?

yanrez commented 5 years ago

@halter73 - have you had a chance to look into adding this use case to benchmarking? i assume this will give us a clear picture of how it performs across various versions

halter73 commented 5 years ago

@yanrez I haven't looked into this. @sebastienros Is this something you could do?

sebastienros commented 5 years ago

I did some tests today to get numbers on our dedicated hardware. Using the Plaintext scenario from TechEmpower, setting Connection: close headers on the requests, I found that ASP.NET is faster than Node.js on the same ratio as with reused connections. However, I can also confirm your results that when doing the same on HTTPS then ASP.NET is much slower than Node.js.

I will continue my investigations and check with my colleagues on ways to resolve that.

yanrez commented 5 years ago

Thanks! It matches our observation that unless ssl handshake is involved, asp.net core is pleasantly fast.

allevyMS commented 5 years ago

Current workaround for us: I added a sidecar container on the pod that is running our app. The additional container is running nginx. I accept incoming traffic on the nginx container over https and then funnel the traffic to my app container over http on localhost.

This has improved our throughput and response times. For my test case we are down to sub-second response times.

yanrez commented 5 years ago

Just checking if you have any updates for benchmarking the issue and solving it? Would be nice to see if you are able to systematically track perf of ssl handshakes as .net core evolves, as well as have clarity on how things might improve when .net core 3 ships. I've noticed few PRs @stephentoub published/merged which seem to target improving perf in this flow, but hard to tell what's the bigger picture of this effort.

sebastienros commented 5 years ago

I just added scenarios to our benchmarking. it will track windows, linux, http and https connections creations per second.

stephentoub commented 5 years ago

From what I saw in profiling, the bulk of the impact comes from https://github.com/dotnet/corefx/issues/35086.

analogrelay commented 5 years ago

It sounds like most of the impact here is in corefx and the right people are on the case there. Feel free to correct me if I'm misunderstanding :). I'll close this issue here.

yanrez commented 5 years ago

Is there another issue tracking this problem? I want to make sure it is still tracked

davidfowl commented 5 years ago

@yanrez https://github.com/dotnet/corefx/issues/35086