[Performance] Look into TechEmpower Benchmarks slowness

drujensen commented 7 years ago

Description

While the performance isn't bad in the latest benchmarks, it should be 10 times better. Our internal testing shows at least 1 million requests per second on a smaller 32core system where the 80 core system that the benchmark uses only had 100k requests per second.

faustinoaq commented 7 years ago

TechEmpower issue here :point_right: https://github.com/TechEmpower/FrameworkBenchmarks/issues/2960

timuckun commented 7 years ago

I was running some benchmarks on my machine and could not get amber to deliver better performance than roda until I removed the logging middleware. The logging of every route was slowing it down considerably.

Just FYI.

BTW I would suggest you update your docs to show a sample of how to return json in routes. I figured out by looking at the kemal docs but it wouldn't hurt to throw a paragraph in there.

jasdeepsingh commented 7 years ago

I can assert this finding by @timuckun Here's a before and an after:

With the logging middleware:

➜ crystal-play wrk -d 60 -t 20 -c 1015 http://localhost:3000
Running 1m test @ http://localhost:3000
  20 threads and 1015 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    38.00ms  188.70ms   1.99s    95.92%
    Req/Sec   219.90    185.29     1.65k    73.13%
  228142 requests in 1.00m, 279.36MB read
  Socket errors: connect 0, read 1037, write 0, timeout 15484
Requests/sec:   3795.87
Transfer/sec:      4.65MB

Without logging middleware:

➜ crystal-play wrk -d 60 -t 20 -c 1015 http://localhost:3000
Running 1m test @ http://localhost:3000
  20 threads and 1015 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   123.85ms   31.79ms 227.73ms   82.70%
    Req/Sec   353.22     64.48   848.00     79.17%
  422256 requests in 1.00m, 517.06MB read
  Socket errors: connect 0, read 1287, write 0, timeout 0
Requests/sec:   7026.36
Transfer/sec:      8.60MB

PS: Please note that i'm running these benchmarks on a quite old Macbook Pro with very decent specs (which is probably also running multiple other services in the background)

jasdeepsingh commented 7 years ago

Finally a run with logging disabled & release optimizations:

➜ crystal-play wrk -d 60 -t 20 -c 1015 http://localhost:3000
Running 1m test @ http://localhost:3000
  20 threads and 1015 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    59.22ms   17.25ms 223.96ms   78.81%
    Req/Sec   788.51    182.21     2.52k    78.09%
  942161 requests in 1.00m, 1.13GB read
  Socket errors: connect 0, read 1125, write 0, timeout 0
Requests/sec:  15677.57
Transfer/sec:     19.20MB

faustinoaq commented 7 years ago

@jasdeepsingh and @timuckun We are using a minimal middleware on TFB:

Amber::Server.configure do |app|
  routes :web do
    get "/plaintext", BenchmarkController, :plaintext
    get "/json", BenchmarkController, :json
    get "/db", BenchmarkController, :db
    get "/queries", BenchmarkController, :queries
    get "/updates", BenchmarkController, :updates
    get "/fortunes", BenchmarkController, :fortunes
  end
end

No even pipelines here frameworks/Crystal/amber/config/routes.cr

I still think the big slowdown is on Broken pipe errors shown here

Server amber: Unhandled exception in spawn:
Server amber: Error writing file: Broken pipe (Errno)

The only difference between crystal-raw logs and amber logs are those errors.

I think we can compare crystal-raw vs amber side by side and see what is going on. Maybe the db wrapper or some static controller in some place is between plaintext controller and the browser request.

faustinoaq commented 7 years ago

Also, if we find the solution to this, it would help to @sdogruyol to optimize Kemal too 😄

drujensen commented 7 years ago

@faustinoaq I agree but I don't think there is any difference between crystal-raw and amber. We ran both and they both get the same errors.

faustinoaq commented 7 years ago

@drujensen Oh, if crystal-raw is getting the same error as amber, then multiple machines will be the clue, Did you Try to run TFB on 3 machine setup ?

drujensen commented 7 years ago

@faustinoaq yes, but I don't have access to the same sized servers. At this point I can't duplicate the issue so I have no way of resolving it.

elorest commented 7 years ago

I believe that crystal-raw got less broken pipe errors.

Isaac Sloan - isaacsloan.com Artisan Rails Architect 1391 S Slate Canyon Dr. Provo, Ut 84606 Cell: (435)-669-5005

On Fri, Oct 6, 2017 at 10:39 AM, Faustino Aguilar notifications@github.com wrote:

@drujensen https://github.com/drujensen Oh, if crystal-raw is getting the same error as amber, then multiple machines will be the clue, Did you Try to run TFB on 3 machine setup ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/amberframework/amber/issues/273#issuecomment-334807470, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHmpGDuXREEH6ZiVbhcQO9Z4gbzM5Jrks5splfOgaJpZM4Pd0pA .

drujensen commented 7 years ago

@elorest right, right. We are seeing the same errors, just less of them in crystal-raw. Amber is (obviously) doing more than raw does in each spawn i.e. allocating a new controller per request, calling before and after filters, rendering templates. My guess is we are hitting some resource limit faster than crystal-raw.

I'm not sure if this error is what is causing the slowness though. It's possible that hitting the error creates some sort of delay, but I'm not convinced this is it yet.

eliasjpr commented 7 years ago

I was thinking that we can do some caching of the controller class this will save the initialization and allocation of the class for subsequent requests. The idea is simple, basically keep controller classes in memory and just reset the context which will set everything else in the controller.

elorest commented 7 years ago

If we can verify any sort of speed increase then I'm in. This is what we had originally though and it was frustrating because everything was potentially nil.

Isaac Sloan - isaacsloan.com Artisan Rails Architect 1391 S Slate Canyon Dr. Provo, Ut 84606 Cell: (435)-669-5005

On Wed, Oct 11, 2017 at 7:45 PM, Elias Perez notifications@github.com wrote:

I was thinking that we can do some caching of the controller class this will save the initialization and allocation of the class for subsequent requests. The idea is simple, basically keep controller classes in memory and just reset the context which will set everything else in the controller.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/amberframework/amber/issues/273#issuecomment-335996228, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHmpALfMYhEU_OBjOpobVYglqYcwtIQks5srW9GgaJpZM4Pd0pA .

eliasjpr commented 7 years ago

Yes, maybe you can come up with a proof of concept I know you love the speed 🚤 :)

faustinoaq commented 7 years ago

I'm thinking in something, Currently the GC is multithreaded, so, every process is launching GC threads even when 100% of 80 cores are bussy.

From gitter/IRC:

its so interesting how much faster it is 22.5 vs 18s well, on my laptop now switch.cr is 41 seconds, tail-call.cr is 38 seconds thats like a 20% improvement and your processor has a 100Mhz advantage on turbo frequency those small ipc improvements really add up btw, on my laptop both versions seems to jump from core to core all the time sure but that doesn't happen for e.g. C versions why? the GC is multithreaded ah yeah, i had to use GOMAXPROC=1 or something to get good results from golang versions oh wow jsn-, try GC_MARKERS=1 do crystal have something like that, btw? ah, ok and try GC_NPROCS=1 you'll probablyt find it slower i mean try them seperately btw jsn- itll be interesting to see which is faster From IRC (bridge bot) @FromIRC 17:11 https://github.com/ivmai/bdwgc/blob/master/doc/README.environment#L54 the note here is interesting no, it does exactly that, now 1 core is pegged at 100% and the time is better, 35 seconds for tail-call using which GC_MARKERS=1 i bet using NPROCS is slower user 0m35.715s about the same cool

Perhaps, Kemal & Amber are getting in conflict with multithreaded-GC so the results are lower than crystal-raw

Maybe we can set some variables on setup.sh here, here & here

#!/bin/bash

fw_depends postgresql crystal

shards install

export GC_MARKERS=1

export GC_NPROCS=1

crystal build --release --no-debug src/amber.cr

for i in $(seq 1 $(nproc --all)); do
  #  ./amber &
  # ./kemal &
done

wait

My guess is we are hitting some resource limit faster than crystal-raw

Maybe crystal-raw is getting such high RPS on plaintext test because doesn't have problems with GC.

Meanwhile Kemal & Amber are using some layers of abstraction, classes and modules that can be calling the multithreaded-GC earlier, so just imagine 80 processes launching multiples threads across cores competing for CPU time, seems insane.

WDYT @RX14 @sdogruyol ?

RX14 commented 7 years ago

The broken pipe errors are misleading: they happen when wrk closes the connections to the server as it finishes the benchmark and so don't affect performance. crystal-raw has less of these errors simply because it's lower latency and therefore is servicing less requests at the time of disconnect.

Playing with GC would be fun, but really we need access to the real environment to do these kinds of debugging.

drujensen commented 7 years ago

@faustinoaq I think your right on track and agree having 16 GC threads per process and then launching 80 processes is a possible candidate. I was originally under the impression crystal only launched one thread for GC.

@RX14 is right, until we can duplicate the issue, we won't be able to verify that reducing the GC threads will solve it.

If anyone can provide a large 80 processor box and has time to setup the TechEmpower environment, we might be able to duplicate the issue and then resolve the slowness.

RX14 commented 7 years ago

You should ask the guys at techempower, they probably have a box that size.

drujensen commented 7 years ago

@RX14 Good idea. I've made the request to see if they will do a run for us since we can't duplicate the issue.

drujensen commented 7 years ago

@faustinoaq Can you make the PR with your recommended changes to the TechEmpower benchmark and let Nate know in the issue? https://github.com/TechEmpower/FrameworkBenchmarks/issues/2960

faustinoaq commented 7 years ago

@drujensen Done 👍 https://github.com/TechEmpower/FrameworkBenchmarks/pull/3001

timuckun commented 7 years ago

You could apply for this https://www.cncf.io/announcement/2017/08/08/cloud-native-computing-foundation-expands-developer-access-demand-hardware-new-infrastructure-lab/

CNCF and packet have partnered up to provide heavy duty hardware for open source projects. It costs nothing.

On Sat, Oct 21, 2017 at 5:10 AM, Dru Jensen notifications@github.com wrote:

@faustinoaq https://github.com/faustinoaq I think your right on track and agree having 16 GC threads per process and then launching 80 processes is a possible candidate. I was originally under the impression crystal only launched one thread for GC.

@RX14 https://github.com/rx14 is right, until we can duplicate the issue, we won't be able to verify that reducing the GC threads will solve it.

If anyone can provide a large 80 processor box and has time to setup the TechEmpower environment, we might be able to duplicate the issue and then resolve the slowness.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/amberframework/amber/issues/273#issuecomment-338251357, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGMt2WavZMdMEVTur5P4SO8Gj6jq6vUks5suMXbgaJpZM4Pd0pA .

faustinoaq commented 7 years ago

@timuckun Interesting, I didn't know about CNFC 👍

eliasjpr commented 6 years ago

Should we close this?

faustinoaq commented 6 years ago

Should we close this?

@eliasjpr Well, amber is still pretty slow on plaintext and json :sweat_smile:

https://www.techempower.com/benchmarks/#section=data-r15&hw=ph&test=plaintext&l=hma7sv

elorest commented 6 years ago

@eliasjpr it's definitely not resolved. Not sure it can be though, without the same environment to test in. In our tests it's still fast even on powerful servers. We've tried everything we can think of to solve the slowness in the benchmarks. I'd recommend withdrawing until we have time to make solving the benchmarks a priority.

robacarp commented 6 years ago

I think I'd prefer to focus on feature completeness rather than top performing efficiency. Crystal is so much faster than the norm for web applications, I think we can afford to let the numbers slide a little while pursuing a well rounded feature suite instead.

elorest commented 6 years ago

@robacarp I agree, although I think we should also focus on performance as we write code. Sometimes simply using one method for string manipulation over another can make a huge difference to speed over a few thousand requests. Unfortunately this benchmark ends up being way slower for Amber than our own benchmarks (server to server) even on far weaker boxes. We've tried multiple things and wasted 40+ hours, trying to get the configuration right so that the benchmark will work well with amber. Because of this I think we should temporarily withdraw until we have time to focus solely on that.

faustinoaq commented 6 years ago

Hi, relevant issue here: https://github.com/crystal-lang/crystal-db/issues/77

faustinoaq commented 6 years ago

Interesting comments:

Although I don't generally put too much stock in benchmarks, I can't help but notice that on the TechEmpower framework benchmarks, Crystal's performance is very high relative to the competition except when a database query is involved.

For instance, on the JSON Serialization, and Plaintext tests which do not require a database connection, Crystal's highest implementation marks 88.4% and 79.6% respectively relative to the highest performers. However the four tests involving a database, Single Query, Multiple Queries, Fortunes and Data Updates, Crystal marks 40.3%, 36.9%, 45.9% and 17.8% respectively.

In particular, the Data Updates performance seems low, with several Ruby implementations out performing the Crystal implementations. This leads me to believe their must be some low hanging fruit to improve the performance of database connections in general.

Is anyone aware of optimizations other languages/frameworks are using to improve the database access?

Dart in particular seems to be benefitting from fast database access as it does poorly on JSON serialization and Plaintext but is extremely competitive on the Multiple Queries and Data Updates tests.

https://github.com/crystal-lang/crystal-db/issues/81#issue-301455105

faustinoaq commented 6 years ago

Hey I found this: https://github.com/crystal-lang/crystal/issues/5610 (Too many heap sections)

eliasjpr commented 6 years ago

Not sure what that means in the context of the tests. Can you elaborate please?

faustinoaq commented 6 years ago

Well, TFB is a high load environment, Maybe amber isn't using all memory available because Too many heap section are allocated, WDYT?

the loop got to 17690000 before crashing, and only used 8.6GB of RAM at its peak.

https://github.com/crystal-lang/crystal/issues/5610#issue-289976225

TFB has a more RAM than 8.6GB

RX14 commented 6 years ago

Did we get the error in the logs? no

I haven't a clue why this is being discussed.

faustinoaq commented 6 years ago

The Fortunes test requirements state: "Use of an in-memory cache of Fortune objects or rows by the application is not permitted."

Does this preclude low-level database driver optimizations such as server protocol request de-duplication? I.e. If two Fortunes queries arrive at the database driver at approximately the same time, the driver could choose to send a single query to the database and then serve both clients from a single server response....

...Except where noted, all database queries should be delivered to the database servers as-is and not coalesced or deduplicated at the database driver. In all cases where a database query is required, it is expected that the query will reach and execute on the database server—that is, the full round-trip effort of a query will be part of the workload.

^ Hey a small update about DB tests

https://groups.google.com/forum/#!topic/framework-benchmarks/za-Z4MfUUlQ

faustinoaq commented 6 years ago

screenshot_20180516_154258

Oh, yeah! Now looks like Amber is as fast as crystal on TFB :tada: :smile:

Ref: Results

Ref: Citrine Build

faustinoaq commented 6 years ago

Amber is fast, although Amber is still slow compared with "things" like h2o or ulib :sweat_smile:

A very interesting comment here: https://github.com/TechEmpower/FrameworkBenchmarks/pull/3751#issuecomment-390055572

Currently the best explanation I have is that the database connection pools are properly sized (well, it is still a work in progress). Otherwise the implementation uses pretty standard techniques:

fully asynchronous database commands
the program never blocks unless absolutely necessary efficient load balancing among the worker threads with SO_REUSEPORT
this is especially important for thread-local database connection pools (more on that later), so that one never ends up in the situation in which one pool is fully saturated, while another one is idle prepared statements
created once when a database connection is established minimizing the number of system calls
libh2o uses writev(), so all fortunes are sent with one system call avoiding memory allocations and copies as much as possible extensive usage of the memory pool functionality provided by libh2o
every HTTP request has an associated memory pool, and all allocations belonging to the pool are freed together in one go after the request is completed; note that it is not possible to override the memory management routines used by the database client library, libpq, for example, so not every allocation is a pooled one.
natural merge sort as the sorting algorithm
takes roughly linear time if the input is already almost sorted
memcmp() as the string comparison function
no Unicode normalization, locale-specific issues, etc. (but keep in mind that UTF-8 is used as per the requirements)

Ref: https://github.com/TechEmpower/FrameworkBenchmarks/pull/3751

drujensen commented 6 years ago

@faustinoaq this is great news and I'm interested to find out what changed in the Benchmarks that finally fixed this. @elorest and I knew that our local testing showed way better results than earlier rounds but we couldn't pinpoint why.

I think we can finally close this issue.

RX14 commented 6 years ago

I think it's just the change in hardware that's used in benchmarking?

amberframework / amber

[Performance] Look into TechEmpower Benchmarks slowness #273

Description