Do you have an automated process to ensure that Kestrel performance doesn't regress?

rianjs commented 6 years ago

I asked a question on StackOverflow this morning about implementing a CI build that would fail if performance regressed too much. It occurred to me that Kestrel might have something like this using Benchmark.NET.

There's obviously a bunch of performance tests in Kestrel. But is the process for making sure egregious perf regressions don't occur ad hoc or is it systematized? If it's systematized, how did you do it?

Tratcher commented 6 years ago

@sebastienros

davidfowl commented 6 years ago

Yes it’s fully automated, see https://github.com/aspnet/benchmarks/blob/dev/README.md for more details.

hmvs commented 6 years ago

@davidfowl I have huge performance hit (3x times) after switching to core 2.1 and especially managed sockets. Test was performed on the same EC2 machines, I've just replaced docker images (code was the same only .net core and tpl nugets were updated). CPU on the target machine were jumping to 100% (looks like GC)

asp.net core 2.0 was based on image microsoft/aspnetcore-build:2.0.5-2.1.4 asp.net core 2.1 build by image microsoft/dotnet:2.1-sdk and run on microsoft/dotnet:2.1.0-aspnetcore-runtime

Test results: asp.net core 2.0 run 1

ubuntu@ip-10-15-10-229:~$ wrk -t 2 -c 600 -d 1m -s ./firehose.lua --latency http://10.15.10.77
Running 1m test @ http://10.15.10.77
  2 threads and 600 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   180.00ms   49.99ms 515.80ms   83.41%
    Req/Sec     1.81k     0.97k    3.73k    53.00%
  Latency Distribution
     50%  170.40ms
     75%  187.14ms
     90%  242.98ms
     99%  407.57ms
  200731 requests in 1.00m, 18.38MB read
Requests/sec:   3344.61
Transfer/sec:    313.56KB

asp.net core 2.0 run 2

ubuntu@ip-10-15-10-229:~$ wrk -t 2 -c 600 -d 1m -s ./firehose.lua --latency http://10.15.10.77
Running 1m test @ http://10.15.10.77
  2 threads and 600 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   173.56ms   47.42ms 746.84ms   84.59%
    Req/Sec     1.85k     0.98k    3.98k    50.99%
  Latency Distribution
     50%  168.35ms
     75%  185.38ms
     90%  212.65ms
     99%  360.20ms
  206067 requests in 1.00m, 18.87MB read
Requests/sec:   3428.89
Transfer/sec:    321.46KB

asp.net core 2.1 run 1:

ubuntu@ip-10-15-10-229:~$ wrk -t 2 -c 600 -d 1m -s ./firehose.lua --latency http://10.15.10.77
Running 1m test @ http://10.15.10.77
  2 threads and 600 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   172.35ms   72.90ms 929.68ms   82.63%
    Req/Sec     1.77k     0.94k    4.01k    56.81%
  Latency Distribution
     50%  157.70ms
     75%  193.70ms
     90%  248.64ms
     99%  454.83ms
  81698 requests in 1.00m, 7.48MB read
  Socket errors: connect 0, read 0, write 0, timeout 600
Requests/sec:   1361.06
Transfer/sec:    127.60KB

asp.net core 2.1 run 2

ubuntu@ip-10-15-10-229:~$ wrk -t 2 -c 600 -d 1m -s ./firehose.lua --latency http://10.15.10.77
Running 1m test @ http://10.15.10.77
  2 threads and 600 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   161.83ms   67.24ms 952.09ms   87.61%
    Req/Sec     1.84k     1.01k    4.34k    51.37%
  Latency Distribution
     50%  155.43ms
     75%  182.37ms
     90%  212.81ms
     99%  342.32ms
  94846 requests in 1.00m, 8.68MB read
  Socket errors: connect 0, read 0, write 0, timeout 4188
Requests/sec:   1579.94
Transfer/sec:    148.12KB

asp.net core 2.1 Libuv run 1

ubuntu@ip-10-15-10-229:~$ wrk -t 2 -c 600 -d 1m -s ./firehose.lua --latency http://10.15.10.77
Running 1m test @ http://10.15.10.77
  2 threads and 600 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   193.50ms  191.16ms   1.98s    96.88%
    Req/Sec     1.96k     0.98k    3.78k    47.58%
  Latency Distribution
     50%  161.09ms
     75%  177.99ms
     90%  212.65ms
     99%    1.46s
  211659 requests in 1.00m, 19.38MB read
Requests/sec:   3525.03
Transfer/sec:    330.47KB

asp.net core 2.1 Libuv run 2

ubuntu@ip-10-15-10-229:~$ wrk -t 2 -c 600 -d 1m -s ./firehose.lua --latency http://10.15.10.77
Running 1m test @ http://10.15.10.77
  2 threads and 600 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   194.79ms  191.99ms   1.96s    96.01%
    Req/Sec     1.95k     0.98k    4.41k    49.72%
  Latency Distribution
     50%  161.56ms
     75%  181.06ms
     90%  223.84ms
     99%    1.42s
  209424 requests in 1.00m, 19.17MB read
Requests/sec:   3484.71
Transfer/sec:    326.69KB

sebastienros commented 6 years ago

For those on this thread, the difference in results seems to come from timeouts on Sockets.

Is there a way we could be able to repro your numbers, by sharing the source code. Also what is the goal of forehose.lua ?

I would suggest that for now you use Libuv until we understand what the issue is with Sockets in your scenario.

hmvs commented 6 years ago

@sebastienros firehose.lua it's lua script with data which chosen by random. How can I share my code with microsoft team privately?

sebastienros commented 6 years ago

Ideally a public repository with a simple repro, something that doesn't require a lot of setup, and that has a minimum of code so we can identify the reason more easily. Otherwise Skype it to me, sebastienros.

sebastienros commented 6 years ago

As an update, I got the source code that shows the issue.

akatz0813 commented 6 years ago

@sebastienros any indication as to the scenario causing this? we're seeing similar performance degredation by using sockets server, although even successfully completed requests are returning slower. When switching to libuv, performance restores. This is on 2.1.1.

MatthewLymer commented 6 years ago

I'm currently debugging a similar issue.

Upgraded my application from Kestrel 2.0.0 to 2.1.1 and have noticed a large increase in CPU usage on my webservers (at least 2x).

I'm currently in the process of specifying UseLibuv() for Kestrel 2.1.1 to see if it alleviates my problem.

sebastienros commented 6 years ago

@akatz0813 @MatthewLymer I have been able to reproduce this kind of degradation when the application is itself using the Thread Pool intensively, for instance with the TPL libraries (more specifically the DataFlow https://docs.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library). Switching to Libuv solved the issue. This is probably due to the fact that these libraries would starve the thread pool and impact Kestrel Sockets transport.

Are you doing anything like this in your applications?

MatthewLymer commented 6 years ago

@sebastienros yes, we have a lot of async await keywords in our app.

Orange: Kestrel 2.1.1 w/ sockets Blue: Kestrel 2.1.1 w/ libuv

The impact was actually much more than I thought.

OS: Ubuntu 16.0.4 AWS Instance: c5.xlarge (4 core/8gb) Runtime: Microsoft.NETCore.App 2.1.2

sebastienros commented 6 years ago

I am not talking about async/await, but the TPL. If you are not using the TPL, can you share your application with me? Or part of it that reproduces the issue?

MatthewLymer commented 6 years ago

Oh, then in that case I'm not using the TPL.

Would it be possible to give kestrel its own Thread Pool to avoid contention with the rest of the application?

As for a reproduction, it would be tough to nail it down as the application is fairly large. Though 99% of the app is just a bunch of HttpClient, Socket, and SslStream async calls.

MatthewLymer commented 6 years ago

Also, a lot of this stuff I'm talking about we do in parallel w/ Task.WhenAll, some incoming http requests result in the application processing up to 100 network calls in parallel.

benaadams commented 6 years ago

You can detect threadpool starvation by adding Ben.BlockingDetector and outputting warning messages to the log; it should hopefully highlight the sources.

MatthewLymer commented 6 years ago

Thank you @benaadams, this looks like a very useful tool and I'll check it out when I have time.

The rest of the application performs within our specifications when using Libuv, and even when transition from CurlHttpHandler to SocketsHttpHandler in dotnet-core-2.1.

As for the CPU usage itself, why would CPU usage increase if the Kestrel server is waiting for a task to start?

benaadams commented 6 years ago

As for the CPU usage itself, why would CPU usage increase if the Kestrel server is waiting for a task to start?

Are you creating Task.Delays faster than their timeouts and not cancelling them? (So end up with a large build up of waiting timers) or creating HttpClients and not disposing of them? (not using HttpClientFactory or a static HttpClient)

MatthewLymer commented 6 years ago

I don't have any Task.Delays in my codebase. I am creating HttpClients, but they're instanciated in classes that are registered in the DI framework as singleton scope (and I dispose of them afterwards).

The only thing I can think of is I have a single CPU bound (synchronous) operation per request created with Task.Run(() => ...) that is running that I eventually wait on with a await Task.WhenAll(...). I use the WhenAll to wait for both the CPU bound and IO bound operations to complete.

akatz0813 commented 6 years ago

@sebastienros we are not using TPL either.

akatz0813 commented 6 years ago

@sebastienros any further thoughts on this?

sebastienros commented 6 years ago

@akatz0813 @MatthewLymer your problems are not related to this issue, which might be confusing. Would you mind filing a new issue individually so I can track the cases independently? I assume this current issue could then be closed.

As for making any progress, we would really need an application that reproduces this behavior. The difference in perf is quite big, which is more that what we see in our own perf tests comparing Libuv and Sockets even in the worst case, and we test on many types of apps (mvc, db, razor, apis). Only the pure json serialization ones so far show a better perf for Libuv, and not that much. We obviously don't have enough coverage, and being able to repro your issues locally would be really helpful, at least to be able to understand where the difference is coming from, and then give guidance for other customers who face the same problem.

Would you be able to provide me with an application, as simple as possible, that would show this behavior? Or at least the profile of the logic that is applied on these specifically slow APIs? I already got some explanation from @MatthewLymer but the more details, the easier it will be for me to repro and investigate.

If you want to take it privately to send me some code, please email me at first.last@microsoft.com.

akatz0813 commented 6 years ago

@sebastienros will work on simplifying our application to the point we can share it. Thank you!

MatthewLymer commented 6 years ago

@sebastienros I'll look at creating a simple repro for you.

sebastienros commented 6 years ago

I am closing this issue because the original question has been answered.

However @MatthewLymer @hmvs @akatz0813 I am filing a more appropriate issue for your cases.

sebastienros commented 6 years ago

https://github.com/aspnet/KestrelHttpServer/issues/2991

aspnet / KestrelHttpServer

Do you have an automated process to ensure that Kestrel performance doesn't regress? #2621