Closed drujensen closed 6 years ago
TechEmpower issue here :point_right: https://github.com/TechEmpower/FrameworkBenchmarks/issues/2960
I was running some benchmarks on my machine and could not get amber to deliver better performance than roda until I removed the logging middleware. The logging of every route was slowing it down considerably.
Just FYI.
BTW I would suggest you update your docs to show a sample of how to return json in routes. I figured out by looking at the kemal docs but it wouldn't hurt to throw a paragraph in there.
I can assert this finding by @timuckun Here's a before and an after:
With the logging middleware:
β crystal-play wrk -d 60 -t 20 -c 1015 http://localhost:3000
Running 1m test @ http://localhost:3000
20 threads and 1015 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 38.00ms 188.70ms 1.99s 95.92%
Req/Sec 219.90 185.29 1.65k 73.13%
228142 requests in 1.00m, 279.36MB read
Socket errors: connect 0, read 1037, write 0, timeout 15484
Requests/sec: 3795.87
Transfer/sec: 4.65MB
Without logging middleware:
β crystal-play wrk -d 60 -t 20 -c 1015 http://localhost:3000
Running 1m test @ http://localhost:3000
20 threads and 1015 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 123.85ms 31.79ms 227.73ms 82.70%
Req/Sec 353.22 64.48 848.00 79.17%
422256 requests in 1.00m, 517.06MB read
Socket errors: connect 0, read 1287, write 0, timeout 0
Requests/sec: 7026.36
Transfer/sec: 8.60MB
PS: Please note that i'm running these benchmarks on a quite old Macbook Pro with very decent specs (which is probably also running multiple other services in the background)
Finally a run with logging disabled & release optimizations:
β crystal-play wrk -d 60 -t 20 -c 1015 http://localhost:3000
Running 1m test @ http://localhost:3000
20 threads and 1015 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 59.22ms 17.25ms 223.96ms 78.81%
Req/Sec 788.51 182.21 2.52k 78.09%
942161 requests in 1.00m, 1.13GB read
Socket errors: connect 0, read 1125, write 0, timeout 0
Requests/sec: 15677.57
Transfer/sec: 19.20MB
@jasdeepsingh and @timuckun We are using a minimal middleware on TFB:
Amber::Server.configure do |app|
routes :web do
get "/plaintext", BenchmarkController, :plaintext
get "/json", BenchmarkController, :json
get "/db", BenchmarkController, :db
get "/queries", BenchmarkController, :queries
get "/updates", BenchmarkController, :updates
get "/fortunes", BenchmarkController, :fortunes
end
end
No even pipelines here frameworks/Crystal/amber/config/routes.cr
I still think the big slowdown is on Broken pipe errors shown here
Server amber: Unhandled exception in spawn:
Server amber: Error writing file: Broken pipe (Errno)
The only difference between crystal-raw logs and amber logs are those errors.
I think we can compare crystal-raw vs amber side by side and see what is going on. Maybe the db wrapper or some static controller in some place is between plaintext controller and the browser request.
Also, if we find the solution to this, it would help to @sdogruyol to optimize Kemal too π
@faustinoaq I agree but I don't think there is any difference between crystal-raw and amber. We ran both and they both get the same errors.
@drujensen Oh, if crystal-raw
is getting the same error as amber, then multiple machines will be the clue, Did you Try to run TFB on 3 machine setup ?
@faustinoaq yes, but I don't have access to the same sized servers. At this point I can't duplicate the issue so I have no way of resolving it.
I believe that crystal-raw got less broken pipe errors.
Isaac Sloan - isaacsloan.com Artisan Rails Architect 1391 S Slate Canyon Dr. Provo, Ut 84606 Cell: (435)-669-5005
On Fri, Oct 6, 2017 at 10:39 AM, Faustino Aguilar notifications@github.com wrote:
@drujensen https://github.com/drujensen Oh, if crystal-raw is getting the same error as amber, then multiple machines will be the clue, Did you Try to run TFB on 3 machine setup ?
β You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/amberframework/amber/issues/273#issuecomment-334807470, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHmpGDuXREEH6ZiVbhcQO9Z4gbzM5Jrks5splfOgaJpZM4Pd0pA .
@elorest right, right. We are seeing the same errors, just less of them in crystal-raw. Amber is (obviously) doing more than raw does in each spawn i.e. allocating a new controller per request, calling before and after filters, rendering templates. My guess is we are hitting some resource limit faster than crystal-raw.
I'm not sure if this error is what is causing the slowness though. It's possible that hitting the error creates some sort of delay, but I'm not convinced this is it yet.
I was thinking that we can do some caching of the controller class this will save the initialization and allocation of the class for subsequent requests. The idea is simple, basically keep controller classes in memory and just reset the context which will set everything else in the controller.
If we can verify any sort of speed increase then I'm in. This is what we had originally though and it was frustrating because everything was potentially nil.
Isaac Sloan - isaacsloan.com Artisan Rails Architect 1391 S Slate Canyon Dr. Provo, Ut 84606 Cell: (435)-669-5005
On Wed, Oct 11, 2017 at 7:45 PM, Elias Perez notifications@github.com wrote:
I was thinking that we can do some caching of the controller class this will save the initialization and allocation of the class for subsequent requests. The idea is simple, basically keep controller classes in memory and just reset the context which will set everything else in the controller.
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/amberframework/amber/issues/273#issuecomment-335996228, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHmpALfMYhEU_OBjOpobVYglqYcwtIQks5srW9GgaJpZM4Pd0pA .
Yes, maybe you can come up with a proof of concept I know you love the speed π€ :)
I'm thinking in something, Currently the GC is multithreaded, so, every process is launching GC threads even when 100% of 80 cores are bussy.
From gitter/IRC:
its so interesting how much faster it is 22.5 vs 18s well, on my laptop now switch.cr is 41 seconds, tail-call.cr is 38 seconds thats like a 20% improvement and your processor has a 100Mhz advantage on turbo frequency those small ipc improvements really add up btw, on my laptop both versions seems to jump from core to core all the time sure but that doesn't happen for e.g. C versions why? the GC is multithreaded ah yeah, i had to use GOMAXPROC=1 or something to get good results from golang versions oh wow jsn-, try GC_MARKERS=1 do crystal have something like that, btw? ah, ok and try GC_NPROCS=1 you'll probablyt find it slower i mean try them seperately btw jsn- itll be interesting to see which is faster From IRC (bridge bot) @FromIRC 17:11 https://github.com/ivmai/bdwgc/blob/master/doc/README.environment#L54 the note here is interesting no, it does exactly that, now 1 core is pegged at 100% and the time is better, 35 seconds for tail-call using which GC_MARKERS=1 i bet using NPROCS is slower user 0m35.715s about the same cool
Perhaps, Kemal & Amber are getting in conflict with multithreaded-GC so the results are lower than crystal-raw
Maybe we can set some variables on setup.sh here, here & here
#!/bin/bash
fw_depends postgresql crystal
shards install
export GC_MARKERS=1
export GC_NPROCS=1
crystal build --release --no-debug src/amber.cr
for i in $(seq 1 $(nproc --all)); do
# ./amber &
# ./kemal &
done
wait
My guess is we are hitting some resource limit faster than crystal-raw
Maybe crystal-raw
is getting such high RPS on plaintext test because doesn't have problems with GC.
Meanwhile Kemal & Amber are using some layers of abstraction, classes and modules that can be calling the multithreaded-GC earlier, so just imagine 80 processes launching multiples threads across cores competing for CPU time, seems insane.
WDYT @RX14 @sdogruyol ?
The broken pipe errors are misleading: they happen when wrk
closes the connections to the server as it finishes the benchmark and so don't affect performance. crystal-raw
has less of these errors simply because it's lower latency and therefore is servicing less requests at the time of disconnect.
Playing with GC would be fun, but really we need access to the real environment to do these kinds of debugging.
@faustinoaq I think your right on track and agree having 16 GC threads per process and then launching 80 processes is a possible candidate. I was originally under the impression crystal only launched one thread for GC.
@RX14 is right, until we can duplicate the issue, we won't be able to verify that reducing the GC threads will solve it.
If anyone can provide a large 80 processor box and has time to setup the TechEmpower environment, we might be able to duplicate the issue and then resolve the slowness.
You should ask the guys at techempower, they probably have a box that size.
@RX14 Good idea. I've made the request to see if they will do a run for us since we can't duplicate the issue.
@faustinoaq Can you make the PR with your recommended changes to the TechEmpower benchmark and let Nate know in the issue? https://github.com/TechEmpower/FrameworkBenchmarks/issues/2960
@drujensen Done π https://github.com/TechEmpower/FrameworkBenchmarks/pull/3001
You could apply for this https://www.cncf.io/announcement/2017/08/08/cloud-native-computing-foundation-expands-developer-access-demand-hardware-new-infrastructure-lab/
CNCF and packet have partnered up to provide heavy duty hardware for open source projects. It costs nothing.
On Sat, Oct 21, 2017 at 5:10 AM, Dru Jensen notifications@github.com wrote:
@faustinoaq https://github.com/faustinoaq I think your right on track and agree having 16 GC threads per process and then launching 80 processes is a possible candidate. I was originally under the impression crystal only launched one thread for GC.
@RX14 https://github.com/rx14 is right, until we can duplicate the issue, we won't be able to verify that reducing the GC threads will solve it.
If anyone can provide a large 80 processor box and has time to setup the TechEmpower environment, we might be able to duplicate the issue and then resolve the slowness.
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/amberframework/amber/issues/273#issuecomment-338251357, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGMt2WavZMdMEVTur5P4SO8Gj6jq6vUks5suMXbgaJpZM4Pd0pA .
@timuckun Interesting, I didn't know about CNFC π
Should we close this?
Should we close this?
@eliasjpr Well, amber is still pretty slow on plaintext and json :sweat_smile:
https://www.techempower.com/benchmarks/#section=data-r15&hw=ph&test=plaintext&l=hma7sv
@eliasjpr it's definitely not resolved. Not sure it can be though, without the same environment to test in. In our tests it's still fast even on powerful servers. We've tried everything we can think of to solve the slowness in the benchmarks. I'd recommend withdrawing until we have time to make solving the benchmarks a priority.
I think I'd prefer to focus on feature completeness rather than top performing efficiency. Crystal is so much faster than the norm for web applications, I think we can afford to let the numbers slide a little while pursuing a well rounded feature suite instead.
@robacarp I agree, although I think we should also focus on performance as we write code. Sometimes simply using one method for string manipulation over another can make a huge difference to speed over a few thousand requests. Unfortunately this benchmark ends up being way slower for Amber than our own benchmarks (server to server) even on far weaker boxes. We've tried multiple things and wasted 40+ hours, trying to get the configuration right so that the benchmark will work well with amber. Because of this I think we should temporarily withdraw until we have time to focus solely on that.
Hi, relevant issue here: https://github.com/crystal-lang/crystal-db/issues/77
Interesting comments:
Although I don't generally put too much stock in benchmarks, I can't help but notice that on the TechEmpower framework benchmarks, Crystal's performance is very high relative to the competition except when a database query is involved.
For instance, on the JSON Serialization, and Plaintext tests which do not require a database connection, Crystal's highest implementation marks 88.4% and 79.6% respectively relative to the highest performers. However the four tests involving a database, Single Query, Multiple Queries, Fortunes and Data Updates, Crystal marks 40.3%, 36.9%, 45.9% and 17.8% respectively.
In particular, the Data Updates performance seems low, with several Ruby implementations out performing the Crystal implementations. This leads me to believe their must be some low hanging fruit to improve the performance of database connections in general.
Is anyone aware of optimizations other languages/frameworks are using to improve the database access?
Dart in particular seems to be benefitting from fast database access as it does poorly on JSON serialization and Plaintext but is extremely competitive on the Multiple Queries and Data Updates tests.
https://github.com/crystal-lang/crystal-db/issues/81#issue-301455105
Hey I found this: https://github.com/crystal-lang/crystal/issues/5610 (Too many heap sections)
Not sure what that means in the context of the tests. Can you elaborate please?
Well, TFB is a high load environment, Maybe amber isn't using all memory available because Too many heap section are allocated, WDYT?
the loop got to 17690000 before crashing, and only used 8.6GB of RAM at its peak.
https://github.com/crystal-lang/crystal/issues/5610#issue-289976225
TFB has a more RAM than 8.6GB
Did we get the error in the logs? no
I haven't a clue why this is being discussed.
The Fortunes test requirements state: "Use of an in-memory cache of Fortune objects or rows by the application is not permitted."
Does this preclude low-level database driver optimizations such as server protocol request de-duplication? I.e. If two Fortunes queries arrive at the database driver at approximately the same time, the driver could choose to send a single query to the database and then serve both clients from a single server response....
...Except where noted, all database queries should be delivered to the database servers as-is and not coalesced or deduplicated at the database driver. In all cases where a database query is required, it is expected that the query will reach and execute on the database serverβthat is, the full round-trip effort of a query will be part of the workload.
^ Hey a small update about DB tests
https://groups.google.com/forum/#!topic/framework-benchmarks/za-Z4MfUUlQ
Oh, yeah! Now looks like Amber is as fast as crystal on TFB :tada: :smile:
Ref: Results
Ref: Citrine Build
Amber is fast, although Amber is still slow compared with "things" like h2o or ulib :sweat_smile:
A very interesting comment here: https://github.com/TechEmpower/FrameworkBenchmarks/pull/3751#issuecomment-390055572
Currently the best explanation I have is that the database connection pools are properly sized (well, it is still a work in progress). Otherwise the implementation uses pretty standard techniques:
SO_REUSEPORT
writev()
, so all fortunes are sent with one system call
avoiding memory allocations and copies as much as possible
extensive usage of the memory pool functionality provided by libh2omemcmp()
as the string comparison functionRef: https://github.com/TechEmpower/FrameworkBenchmarks/pull/3751
@faustinoaq this is great news and I'm interested to find out what changed in the Benchmarks that finally fixed this. @elorest and I knew that our local testing showed way better results than earlier rounds but we couldn't pinpoint why.
I think we can finally close this issue.
I think it's just the change in hardware that's used in benchmarking?
Description
While the performance isn't bad in the latest benchmarks, it should be 10 times better. Our internal testing shows at least 1 million requests per second on a smaller 32core system where the 80 core system that the benchmark uses only had 100k requests per second.