Amber, Kemal, Kemalyst results in question

drujensen commented 7 years ago

Hi Nate,

We are baffled by the results for the 3 crystal projects and hoping you can help shed some light on the 100k RPS results. We were expecting 1.8 to 2.0 million RPS range for these projects.

We have done multiple runs on different sized systems in AWS trying to replicate the slow response times for plaintext and json in preview 2. @sdogruyol has also ran the benchmarks and can confirm similar findings.

I'm attaching the results on a c4.xlarge 8 core system and all three projects are showing 3x the results than the preview 2. We have also ran the tests on a 64 core 1.2ghz system with 10x the requests per second (~1 million RPS) than preview 2.

We followed the instructions provided for setting up the linux environments Ubuntu 14.04 and cannot determine what the difference is that would cause the slowness.

We think that the preview 2 is not correctly reflecting the performance for these 3 projects. It's as if the multiple processes or the reuse port was not properly working. What we find interesting is the crystal-raw did not encounter the same issue and had 2.5m RPS which is in line with our expectations.

We are aware of socket errors when there is heavy load, be we are not seeing this slow down the performance reflected in preview 2.

Any help is solving this mystery is appreciated. If possible, can you rerun the plaintext for the 3 projects and let us know if you are still seeing 100k RPS results?

Thanks, Dru

results.zip

drujensen commented 7 years ago

https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/Crystal/amber/setup.sh#L13

paulcsmith commented 7 years ago

Looks like the db results for all the crystal ones are pretty slow. Is this because the crystal db driver is just super slow? Seems like it could/should be much higher

marksiemers commented 7 years ago

@paulcsmith - See RX-14's comment above about that being a different issue (than framework slowness vs crystal-raw slowness)

That said, I'm looking into it - whether or not prepared statements will help.

But yes, we think either crystal-db or crystal-pg would be the place to address the performance issue.

If you have any thoughts, it merits discussion in crystal's IRC or gitter.

paulcsmith commented 7 years ago

Oh I see. I was linked directly to your comment from elsewhere and missed that discussion. Sorry to muddy up this thread! Thanks for linking me to RX14's answer

marksiemers commented 7 years ago

@nbrady-techempower I'm running some tests locally in an attempt to figure out how we can fix this.

The tool to see results based on JSON is really helpful, but amber I don't think gets parsed because it has not been part of the official results yet. Is there an easy way for me to fix that? (Hacking on the JS, or otherwise?)

results.json - http://sprunge.us/iQXe Pasting it here: https://www.techempower.com/benchmarks/#section=test

I can see crystal-raw and kemal results with no problem, but amber doesn't show up. It is a pain to "manually" compute the stats from the JSON, if we can use that tool, that would be a great help.

faustinoaq commented 7 years ago

@marksiemers A workaround would be renaming Amber to Kemalyst

faustinoaq commented 7 years ago

@marksiemers No, kemalyst doesn't work, another workaround is renaming Kemal to Kemalyst and Amber to Kemal.

marksiemers commented 7 years ago

Thanks faustinoaq, that should work for now.

@nbrady-techempower - earlier you posted a comparison output that looked like this: Kemal Comparison

Is the tool that produced that available to us?

NateBrady23 commented 7 years ago

@marksiemers No, it's not currently available, but I don't mind taking some screen caps for you if you drop me a few results files to compare.

marksiemers commented 7 years ago

Sure thing - it may not be until tomorrow, I don't want to send you too many, but maybe get the list down to 3-4

marksiemers commented 7 years ago

@nbrady-techempower - I just manually compared some test runs (seeing 25-60% improvements in db benchmarks). Thanks for the screen grab offer, but I don't think it'll be necessary.

What I would like to do is run the tests in a way that more mimics your 3 server setup.

I'm most familiar with AWS, is there a guide on how to get the 3 server setup running "in the cloud"?

I'd like to, so we can try to debug why either adding that many cores or adding that internal network latency is causing the slowness.

In my local testing (a virtual box running on a 2-core Macbook pro), I got ~450K rps. results here: http://sprunge.us/cTAH

If you want to see the results from that test run for amber using this, use the json below, and anywhere you see "phoenix" - that is amber: https://www.zerobin.net/?216090c5938d187f#K0gaz8Jxp7MzbjteLW1lSBpT614LSSmdVq4+XiLyIMw=

NateBrady23 commented 7 years ago

A single machine setup where localhost is the client, server, and database will definitely get you inflated results. I can get faster results on a virtual box vm for most tests because there's no network latency.

The installation guide should work fine for a three machine setup: http://frameworkbenchmarks.readthedocs.io/en/latest/Development/Installation-Guide/

And these are the specs: http://frameworkbenchmarks.readthedocs.io/en/latest/Project-Information/Environment/

RX14 commented 7 years ago

But why would network latency affect these frameworks so much but not crystal? All are evented and use the same core HTTP abstraction. It makes little sense.

NateBrady23 commented 7 years ago

Doesn't make sense to me either. I don't know what else to tell you guys.

marksiemers commented 7 years ago

@RX14 - I can't wrap my head around it either. I can only speculate that the network latency between the DB and the app server causes a build up of database connections which creates memory bloat and/or GC overhead for the frameworks but somehow not for crystal-raw.

For amber, the connection would be held by the ORM through the models that exist - one of the ways it diverges from crystal raw.

In any case, I think our only option at this point is to guess and check on a 3 server setup. My thoughts on possible guesses:

Mess with the connection pool settings, relative to the number of processors - perhaps this will save memory bloat and GC overhead (saw 25%-60% performance increase on db benchmarks*)
Mess with the GC_MARKER number, relative to number of processors - it might take the right balance
100% gut everything DB related from kemal and amber, and see if there is still a discrepancy on the 3 server setup for plaintext and JSON.

*Not in an isolated environment, running a virtualbox on my laptop: https://github.com/marksiemers/FrameworkBenchmarks/commit/50811f9f8712b2015ef96c073fbdde8f3125a82a

marksiemers commented 7 years ago

@nbrady-techempower Thank you for your help and patience.

Do you have a schedule for the next preview run and next official run?

RX14 commented 7 years ago

@marksiemers forget ENTIRELY about the DB. The frameworks lag in the plaintext test which never even touches the DB.

RX14 commented 7 years ago

I propose 3 tests that the techempower guys could run to help us troubleshoot this:

run a benchmark and view the CPU utilization in htop, is every core being utilized? Is each crystal process running at 100% load?
run only the HTTP plaintext test of amber, without running the DB-touching tests, and see if there's a performance difference.
run the test on port 8080 similar to crystal-raw, instead of 3000 which the other frameworks use.

NateBrady23 commented 7 years ago

@RX14 It's unlikely that we'll be able to do those for you soon, since our SC environment is currently doing continuous benchmarking. We're trying to get a preview out asap while working on some other things internally. If we do bring the continuous run to a stop at some point, I'll see if I can squeeze that in for you but you would be better off trying to set up a similar environment.

marksiemers commented 7 years ago

@nbrady-techempower - For the 3 tests proposed by RX14, do you need anything from us, in case the opportunity opens up?

NateBrady23 commented 7 years ago

@markseimers no, but as I pointed out earlier in this thread, amber is not passing in that environment and the logs didn't provide an immediate clue as to why. You're also welcome to make a pull request to change the ports, that would show up in a run over the weekend.

marksiemers commented 7 years ago

That error, I also haven't been able to reproduce, unfortunately.

The good news is, since kemal is experiencing the same issue. If we solve it for that, hopefully, it will be the same fix for amber.

marksiemers commented 7 years ago

@nbrady-techempower - Does the PR have to be merged in, or will each open PR be run this weekend?

If each open PR is run, is it ok to open multiple PRs for the crystal frameworks? To test different hypotheses about how to fix the discrepancy between crystal-raw, kemal, and amber.

NateBrady23 commented 7 years ago

@marksiemers Unfortunately it has to be merged in, and a continuous run takes about 4 days to fully complete.

NateBrady23 commented 7 years ago

Here are the kemal results from the weekend. They look very much inline with the results I posted last week. Looks like the port did not make any difference.

marksiemers commented 7 years ago

@nbrady-techempower - Thank you for the prompt update. If you have any info to comment correct my hypothesis below, please weigh in.

Also, is there any time this week to try any of @RX14 suggestions of htop and/or not running the db tests for kemal and amber?

@RX14 - I know you're going to kill me if I say database one more time, but in the absence of other ideas, what about this hypotheses:

At least one of the database tests run before the plaintext and json tests
There is no db connection limit, then each running process (80 of them) are all trying to create an "unlimited" number of connections to postgres. Postgres gets slammed and takes a long time to return any data.
Those connections remain in memory, not idle, but waiting for a response. This either eats all the memory and/or causes hell for the GC.
For some reason (ORM for amber, get(path, &block) for kemal), These memory problems occur only in the frameworks because those connections stay in scope.

Admittedly, this is grabbing at straws, but I'm not sure what else to look for.

AFAIK, setting a connection limit in the database url does no harm (with local testing).

@RX14 @sdogruyol @drujensen Given that, does anyone think it is worth a shot (tweaking db connection settings)?

NateBrady23 commented 7 years ago

@marksiemers @RX14 Do me a favor if you could. Fork the repo and create a branch for each different test configuration you'd like to test against. Don't open any PR's here for them, just point me to the repo with all the different configurations, and I'll see if I can get you some results for each branch. Please don't modify anything but the tests you'd like to run. The current run should finish tomorrow or Wednesday, and I'll see what I can do for you.

elorest commented 7 years ago

@nbrady-techempower @RX14 @drujensen @marksiemers

What thing that stands out to me when reading through, the backlog of comments and things that have been tried and assumed, is that; If this was a GC issue we should also see the same errors while benchmarking on the same servers. @drujensen and I tested on a 36 core server and didn't have any issues.

Also since each core is running it's own process having more cores shouldn't increase the garbage collection issues across all processes unless of course they somehow used all the memory or something.

It's weird that adding a tiny bit of local network latency would drop the results so dramatically, since it doesn't with crystal raw.

Just things to think about that might help.

marksiemers commented 7 years ago

@nbrady-techempower - Sure thing. What timeframe are you expecting?

NateBrady23 commented 7 years ago

@marksiemers I can fit it in on Wednesday morning probably.

marksiemers commented 6 years ago

EDIT: See more recent comment with the updated branch list.

@nbrady-techempower Here is the fork: https://github.com/marksiemers/FrameworkBenchmarks

There are three branches to test if you have time:

max-pool-2-per-process
- https://github.com/marksiemers/FrameworkBenchmarks/tree/max-pool-2-per-process
max-pool-2x-cpus
- https://github.com/marksiemers/FrameworkBenchmarks/tree/max-pool-2x-cpus
use-half-cpus
- https://github.com/marksiemers/FrameworkBenchmarks/tree/use-half-cpus

I tested all of them locally in verify mode, and everything passed, so hopefully, there won't be any issues running the benchmarks.

drujensen commented 6 years ago

I don't know if we want to set the GC_MARKERS to 40. Wouldn't that launch 40 threads per process so 40x40=1600 threads? Or is this setting somehow shared between processes?

https://github.com/marksiemers/FrameworkBenchmarks/blob/use-half-cpus/frameworks/Crystal/amber/setup.sh#L12

What if we set this to 2? 40x2 = 80.

marksiemers commented 6 years ago

@drujensen - I don't know what we want to do at all. What we've tried up this this point hasn't worked. I'm trying not to out-smart myself, and just trying to gather data about what does and doesn't work.

That said, if you think that will make a difference, I've given you access to that fork. You can add a branch or modify use-half-cpus

drujensen commented 6 years ago

@marksiemers ok, i updated the GC_MARKERS=2 for all three frameworks. I also changed Crystal Raw to only use half the processors so we have something to compare against.

NateBrady23 commented 6 years ago

@drujensen @marksiemers The current run still has ~11 hours left. If you need to get anything else in to that repo, you've got some time. I'll get you some numbers tomorrow morning.

marksiemers commented 6 years ago

@RX14 - Any additional ideas?

Both kemal and amber use radix for routing while crystal uses a case statement. It hasn't caused issues in our testing, but maybe it makes a difference in the TechEmpower environment.

faustinoaq commented 6 years ago

@marksiemers you're right, I compared round 15 preview 2 results again and again and I have concluded:

Database tests

amber and kemal are very similar to crystal-raw on database test. I think because databases are the main overhead, no problem here 👍 results are aceptable

screenshot_20171115_135312

screenshot_20171115_135330

screenshot_20171115_135350

screenshot_20171115_135511

screenshot_20171115_135530

Plaintext & Json

However, on the highest RPS tests (plaintext, json) crystal-raw has a huge difference from amber & kemal. Maybe current router radix algorithm implementation by @luislavena is causing some overhead resolving simple paths like /plaintext & /json. I think we can use a handler with context.request.path == "/plaintext" instead of radix algorithm to catch those tests in both kemal & amber. WDYT @sdogruyol ?

screenshot_20171115_135226

screenshot_20171115_135250

drujensen commented 6 years ago

@faustinoaq why do you think its the radix algorithm? Is there something blocking or doing something that makes you suspect it is causing this issue?

marksiemers commented 6 years ago

EDIT: Or you can be smart like @faustinoaq and benchmark case vs radix

@drujensen - Since we can't reproduce the issue, we don't have any "evidence" of what is causing it.

I've been trying to approach it more from a deduction strategy rather than debugging - since we don't really have debugging tools available at this point. Or, if you like, the Gregory House approach - just start treatment and see if it gets better.

All we can do is look at what is different between crystal-raw and the same with amber and kemal.

For plaintext and json, that list isn't too big. Routing is one. For radix, the cost in time for me to learn exactly how that shard works and speculate about what might go wrong is a relatively high cost.

The cost of coding a handler that responds before the router is touched is relatively low - and the outcome won't be hypothetical - it will be real (well as real as benchmarks can be).

@faustinoaq - Do you already have in mind how to code this?

faustinoaq commented 6 years ago

@drujensen Yeah, I compared radix vs switch statement (aka case) used by crystal-raw here and the results say radix is almost 80x slower than a case statement finding a simple path like /plaintext

require "benchmark"
require "radix"

Tree = Radix::Tree(Symbol).new
Tree.add "/plaintext", :plaintext

def radix(path)
  Tree.find(path).found?
end

def switch(path)
  case path
  when "/plaintext" then true
  end
end

pp radix("/plaintext")  # => true
pp switch("/plaintext") # => true

Benchmark.ips do |x|
  x.report("radix") do
    radix("/plaintext")
  end
  x.report("switch") do
    switch("/plaintext")
  end
end

$ crystal build --release --no-debug -s radix_vs_switch.cr
Parse:                             00:00:00.0019190 (   0.19MB)
Semantic (top level):              00:00:00.2499810 (  28.12MB)
Semantic (new):                    00:00:00.0012620 (  28.12MB)
Semantic (type declarations):      00:00:00.0216570 (  36.12MB)
Semantic (abstract def check):     00:00:00.0026250 (  36.12MB)
Semantic (ivars initializers):     00:00:00.0032640 (  36.12MB)
Semantic (cvars initializers):     00:00:00.0168870 (  36.12MB)
Semantic (main):                   00:00:00.2823110 (  68.12MB)
Semantic (cleanup):                00:00:00.0010590 (  68.12MB)
Semantic (recursive struct check): 00:00:00.0008610 (  68.12MB)
Codegen (crystal):                 00:00:00.3289690 (  68.18MB)
Codegen (bc+obj):                  00:00:18.9583230 (  68.18MB)
Codegen (linking):                 00:00:00.1888350 (  68.18MB)

Codegen (bc+obj):
 - no previous .o files were reused
$ ./radix_vs_switch
radix("/plaintext") # => true
switch("/plaintext") # => true
 radix   2.12M (471.24ns) (± 6.11%) 87.19× slower
switch 185.03M (   5.4ns) (±11.41%)       fastest
$ ./radix_vs_switch
radix("/plaintext") # => true
switch("/plaintext") # => true
 radix   1.91M (524.01ns) (±14.57%) 100.87× slower
switch  192.5M (  5.19ns) (±11.91%)        fastest
$ ./radix_vs_switch
radix("/plaintext") # => true
switch("/plaintext") # => true
 radix   2.17M (460.29ns) (±17.94%) 83.85× slower
switch 182.18M (  5.49ns) (±10.95%)       fastest

See more result 👉 https://gist.github.com/faustinoaq/896f821f47711c6c059c5eca23030a42

drujensen commented 6 years ago

@faustinoaq @marksiemers Ok, That makes sense. I didn't realize it was that much slower. I understand where your heading and think the test has some validity. If we find that radix is the cause, we can look at fixing it or replacing it.

marksiemers commented 6 years ago

Assuming this is our smoking gun, this is a great lesson learned.

In fact, it may inform some design decisions with amber - like our static routes - we may want to bypass or intercept a request before it hits the router.

What is still bizarre is that we didn't see the same slowdown in our tests.

@faustinoaq - What hardware did you run your tests on? Can you reproduce the TFB slowness on that machine?

luislavena commented 6 years ago

Interesting findings @faustinoaq.

I think comparing a simple case statement with any URL-parsing/matching mechanism will not be fair, being that mechanism powered by Radix or any other library (🍎 != 🍊).

For example, Radix allocates several instances of Char::Reader during lookup which might add GC pressure.

In my tests, the difference between Radix and case statements gets reduced by disabling GC (GC.disable).

Definitely there is room for improvement in Radix, as the information shows here.

Cheers.

marksiemers commented 6 years ago

For the record, when I add all the routes for the benchmarks and build with the --release flag, I'm getting a 200x + difference:

$ shards build --release
...
$ ./bin/benchmarks
radix("/plaintext") # => true
switch("/plaintext") # => true
 radix   1.94M (516.29ns) (± 3.13%) 220.68× slower
switch 427.44M (  2.34ns) (± 8.93%)        fastest
radix("/json") # => true
switch("/json") # => true
 radix   1.71M (585.75ns) (± 3.79%) 248.90× slower
switch 424.92M (  2.35ns) (± 9.49%)        fastest

$ ./bin/benchmarks
radix("/plaintext") # => true
switch("/plaintext") # => true
 radix   1.94M ( 515.4ns) (± 2.90%) 220.19× slower
switch 427.22M (  2.34ns) (± 8.62%)        fastest
radix("/json") # => true
switch("/json") # => true
 radix   2.08M (479.87ns) (± 3.61%) 205.02× slower
switch 427.24M (  2.34ns) (± 8.77%)        fastest

$ ./bin/benchmarks
radix("/plaintext") # => true
switch("/plaintext") # => true
 radix   1.96M (511.16ns) (± 4.04%) 217.59× slower
switch 425.67M (  2.35ns) (± 9.32%)        fastest
radix("/json") # => true
switch("/json") # => true
 radix   2.08M (480.62ns) (± 4.09%) 205.62× slower
switch 427.82M (  2.34ns) (± 8.83%)        fastest

paulcsmith commented 6 years ago

I'm curious how much of a difference this would make in practice. Radix still pulls off a couple million in half a microsecond, so it seems like it should make little different in each request. But maybe it's the object allocations that are the issue. I think it is definitely worth testing out

faustinoaq commented 6 years ago

@luislavena Thank you for your comment! You're right about 🍎 != 🍊, is very unfair compare radix to case statement. Of course radix have a cool features like /path/:id and /path/*, etc.

Do you think we can improve radix for simple paths without symbols : or * ?

something like:

unless found_special_symbols_in_path
  case path
  when "/simple"
  # do something
  end
end

marksiemers commented 6 years ago

@luislavena

I think comparing a simple case statement with any URL-parsing/matching mechanism will not be fair

Agreed. and I think this particular set of paths is worst case scenario for a radix tree:

Tree = Radix::Tree(Symbol).new
Tree.add "/json", :json
Tree.add "/plaintext", :plaintext
Tree.add "/db", :db
Tree.add "/queries", :queries
Tree.add "/fortunes", :fortunes
Tree.add "/updates", :updates

Radix will remain very valuable - especially for a full blown RESTful app with hundreds of routes.

Optimizing for this kind of scenario (simple path, very high rps) it seems it is not the best tool.

I'm guessing the compiler optimizations can do a lot with those case statements - likely everything on the stack, vs the Radix tree would end up on the heap, right?

I have a local repo with the tests, would you like access to it - just in case (no pun intended), it can inform some optimizations in radix?

luislavena commented 6 years ago

Do you think we can improve radix for simple paths without symbols : or * ?

I don't think that will be possible. Radix was originally extracted from an internal project that deals with 280+ routes (mostly GET and POST) and showed great performance compared with linear approach.

The original design was aimed to build big trees and compare, character by character and return earlier when required.

Adding a separate structure to deal with non-dynamic parts might be tricky, not to mention the cases that /something (static) and /something/:id (dynamic) will need to co-exist.

@marksiemers feel free to submit any code samples or research around Radix directly to the project, here:

https://github.com/luislavena/radix

Cheers.

eliasjpr commented 6 years ago

@luislavena maybe we can extend the current Radix tree implementation to be an Adaptive Radix Tree see https://github.com/armon/libart and https://db.in.tum.de/~leis/papers/ART.pdf

Art should perform faster or as fast as a Hash table without the limitations.

luislavena commented 6 years ago

Thank you @eliasjpr, will take a look to the paper.

Please note that neither a Radix Tree or ART by default will handle dynamic values that needs to be extracted from the trees. That is one of the main differences between a simple Radix implementation and a Radix routing library like the one I shared.

Cheers.

TechEmpower / FrameworkBenchmarks

Amber, Kemal, Kemalyst results in question #2960

Database tests

Plaintext & Json