Closed allevyMS closed 5 years ago
@karelz - let us know if this is wrong place to post this issue, unclear if cause is in aspnetcore or in underlying stack.
To clarify impact: we have a multitenant service deployed into k8s, and we need to handle a spike in http requests (thousands of simultaneous calls). These requests are flowing through nginx on ingress, nginx does ssl termination, but we also have to encrypt traffic between ingress and upstream. So our service has SSL configured in Kestrel. Due to impact of this issue, we can't handle more than 60 requests per pod before nginx starts hitting 5 sec timeout for ssl handshake with upstream
This may be in SslStream, which is my team (in CoreFX repo). Let's first get some measurements identifying the likely root cause. It may be in the layer using SslStream in Kestrel. We've got anecdotal feedback SslStream is slow in the past, but never with any repro or hard evidence.
What are the perf results difference of SSL vs. non-SSL? (certain diffence is expected) If it is truly SslStream bottleneck, then we should be able to remove ASP.NET layer on the server side and show more than expected overhead in SSL ...
@Eilon who is your perf guru who could help narrowing down the root-cause?
cc @davidfowl @stephentoub @geoffkizer
@halter73 @Tratcher
We need to collect a trace so we can narrow down what the problem might be. There's a tool here you can use to collect a trace on linux and open up in perfview:
https://github.com/dotnet/diagnostics/tree/master/src/Tools/dotnet-collect
Unfortunately, we don't have builds of this tool flowing yet so you may have to build it from source.
cc @vancem
@sebastienros you've already got metrics for linux https, no?
We do have metrics:
Our plaintext numbers show about 3.6 million RPS on linux without TLS and about 2.0 million RPS on linux with TLS/SslStream. For our Json benchmark, the numbers are about 730k RPS vs 480k RPS without and with TLS respectively.
Does TLS/SslStream measurement assumes already established ssl connection or each request includes handshake?
We use wrk as our benchmark client which establishes a fixed number of connections (we usually use 256 connections AFAIK) up front and then reuses those connections for all subsequent requests in that run unless theres an error or the server closes the connections.
ab -k
should do the same thing since "-k" is used to enable HTTP keep-alive with ab.
One big difference between the way we're benchmarking is caused by your use of ab's "-n" flag to set the number of requests for the benchmarking session.
We make millions of requests when collecting even a single benchmark result. This means that the cost of the 256 TLS handshakes are heavily amortized over millions of requests in our TLS benchmarks.
Base on my understanding of ab, "ab -k -n 300 -c 100" will perform 100 TLS handshakes, but only make 300 requests. This leaves you with only 3 requests per TLS handshake which I would never expect to perform nearly as well as a similar benchmark without any TLS handshakes.
Correct that we have never measured TLS handshakes specifically :/
we have never measured TLS handshakes specifically :/
It's probably a good time to start measuring this so we can catch regressions. It shouldn't require any changes to the app being benchmarked. Using a lua script so wrk sets a Connection: close
request header should be sufficient to benchmark the handshake plus the relatively-small cost of a single request.
I'll let @allevyMS respond with more details tomorrow. Looks like our case of nginx on ingress and few pods in upstream, and getting couple thousand calls ends up being represented by "ab -k -n 300 -c 100" pretty well, since we end up doing a lot of handshakes and don't benefit from warmed up connections. Could be an interesting gap to close in benchmarks and tracking improvements :)
Is there anything we can actually affect in the handshake? Unless we invoke it multiple times, or do something truly horrible, it is purely OS behavior and performace - the expected SSL overhead. We should compare it to other implementations (non .NET), that would tell us if we're really behind.
@yanrez Since you're using nginx for ingress, could you use nginx's keepalive directive with increased keepalive_requests and keepalive_timeout values to pool nginx-to-kestrel TLS connections and reduce the number of handshakes?
https://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive
@halter73 More info about our setup: we have 32 ingress controllers that serve traffic to many other apps in addition to ours. We have 6 app pods. We don't have continuous and sustained traffic to maintain the keepalive connections with a reasonable keepalive timeout.
The reason we started looking into this is that we have a use case where customers set up LogicApps that create 1k-2k concurrent requests in one burst and there was a 70% failure rate with 502s and 504s.
@karelz I setup Node.js with https on the same host and ran Apache bench with the same inputs: ab -k -n 300 -c 100 https://localhost:8443/
It seems to perform considerably better:
Connection Times (ms)
min | mean | [+/-sd] | median | max | |
---|---|---|---|---|---|
Connect: | 1199 | 2384 | 471.1 | 2502 | 3589 |
Processing: | 594 | 1111 | 182.0 | 1100 | 1601 |
Waiting: | 296 | 618 | 148.0 | 602 | 903 |
Total: | 2704 | 3495 | 359.1 | 3614 | 5190 |
I'm curious; any way you could try this with a preview of .NET Core 3.0 and making sure that OpenSSL 1.1+ is installed?
@halter73 - have you had a chance to look into adding this use case to benchmarking? i assume this will give us a clear picture of how it performs across various versions
@yanrez I haven't looked into this. @sebastienros Is this something you could do?
I did some tests today to get numbers on our dedicated hardware. Using the Plaintext scenario from TechEmpower, setting Connection: close
headers on the requests, I found that ASP.NET is faster than Node.js on the same ratio as with reused connections. However, I can also confirm your results that when doing the same on HTTPS then ASP.NET is much slower than Node.js.
I will continue my investigations and check with my colleagues on ways to resolve that.
Thanks! It matches our observation that unless ssl handshake is involved, asp.net core is pleasantly fast.
Current workaround for us: I added a sidecar container on the pod that is running our app. The additional container is running nginx. I accept incoming traffic on the nginx container over https and then funnel the traffic to my app container over http on localhost.
This has improved our throughput and response times. For my test case we are down to sub-second response times.
Just checking if you have any updates for benchmarking the issue and solving it? Would be nice to see if you are able to systematically track perf of ssl handshakes as .net core evolves, as well as have clarity on how things might improve when .net core 3 ships. I've noticed few PRs @stephentoub published/merged which seem to target improving perf in this flow, but hard to tell what's the bigger picture of this effort.
I just added scenarios to our benchmarking. it will track windows, linux, http and https connections creations per second.
From what I saw in profiling, the bulk of the impact comes from https://github.com/dotnet/corefx/issues/35086.
It sounds like most of the impact here is in corefx and the right people are on the case there. Feel free to correct me if I'm misunderstanding :). I'll close this issue here.
Is there another issue tracking this problem? I want to make sure it is still tracked
Describe the bug
I have been investigating throughput issues with our ASPNet.Core 2.2 webapp and found that using SSL with Kestrel is a major bottleneck for performance.
To Reproduce
I have created the following minimal app:
Program.cs
Startup.cs
I ran it using dotnet version:
Host (useful for support): Version: 2.2.1 Commit: 878dd11e62
ASPNet.Core version: 2.2.1
On OS: PRETTY_NAME="Debian GNU/Linux 9 (stretch)" NAME="Debian GNU/Linux" VERSION_ID="9" VERSION="9 (stretch)" ID=Debian
I used Apache Bench to run a load test on both endpoints with 300 requests and with concurrency set to 100 like so: ab -k -n 300 -c 100 http://localhost:80/ ab -k -n 300 -c 100 https://localhost:443/
results for port 80 without SSL: Connection Times (ms)
results for port 443 with SSL: Connection Times (ms)
As you can see the results are pretty damning for SSL which performs quite a lot worse than I would expect. These results are consistent with additional tests I have run (external to the host and using various load test approaches) and with our production and dev environments metrics (we tested both with and without using Kestrel with SSL).
Expected behavior
Better performance using Kestrel with SSL
Additional context
I am running my webapp behind nginx as a reverse proxy but I still require internal SSL encryption. The current performance using Kestrel with SSL is hurting our production environment throughput in a big way.