Closed drujensen closed 6 years ago
Looks like the db results for all the crystal ones are pretty slow. Is this because the crystal db driver is just super slow? Seems like it could/should be much higher
@paulcsmith - See RX-14's comment above about that being a different issue (than framework slowness vs crystal-raw slowness)
That said, I'm looking into it - whether or not prepared statements will help.
But yes, we think either crystal-db
or crystal-pg
would be the place to address the performance issue.
If you have any thoughts, it merits discussion in crystal's IRC or gitter.
Oh I see. I was linked directly to your comment from elsewhere and missed that discussion. Sorry to muddy up this thread! Thanks for linking me to RX14's answer
@nbrady-techempower I'm running some tests locally in an attempt to figure out how we can fix this.
The tool to see results based on JSON is really helpful, but amber
I don't think gets parsed because it has not been part of the official results yet. Is there an easy way for me to fix that? (Hacking on the JS, or otherwise?)
results.json - http://sprunge.us/iQXe Pasting it here: https://www.techempower.com/benchmarks/#section=test
I can see crystal-raw and kemal results with no problem, but amber doesn't show up. It is a pain to "manually" compute the stats from the JSON, if we can use that tool, that would be a great help.
@marksiemers A workaround would be renaming Amber to Kemalyst
@marksiemers No, kemalyst doesn't work, another workaround is renaming Kemal to Kemalyst and Amber to Kemal.
Thanks faustinoaq, that should work for now.
@nbrady-techempower - earlier you posted a comparison output that looked like this:
Is the tool that produced that available to us?
@marksiemers No, it's not currently available, but I don't mind taking some screen caps for you if you drop me a few results files to compare.
Sure thing - it may not be until tomorrow, I don't want to send you too many, but maybe get the list down to 3-4
@nbrady-techempower - I just manually compared some test runs (seeing 25-60% improvements in db benchmarks). Thanks for the screen grab offer, but I don't think it'll be necessary.
What I would like to do is run the tests in a way that more mimics your 3 server setup.
I'm most familiar with AWS, is there a guide on how to get the 3 server setup running "in the cloud"?
I'd like to, so we can try to debug why either adding that many cores or adding that internal network latency is causing the slowness.
In my local testing (a virtual box running on a 2-core Macbook pro), I got ~450K rps. results here: http://sprunge.us/cTAH
If you want to see the results from that test run for amber using this, use the json below, and anywhere you see "phoenix" - that is amber: https://www.zerobin.net/?216090c5938d187f#K0gaz8Jxp7MzbjteLW1lSBpT614LSSmdVq4+XiLyIMw=
A single machine setup where localhost is the client, server, and database will definitely get you inflated results. I can get faster results on a virtual box vm for most tests because there's no network latency.
The installation guide should work fine for a three machine setup: http://frameworkbenchmarks.readthedocs.io/en/latest/Development/Installation-Guide/
And these are the specs: http://frameworkbenchmarks.readthedocs.io/en/latest/Project-Information/Environment/
But why would network latency affect these frameworks so much but not crystal? All are evented and use the same core HTTP abstraction. It makes little sense.
Doesn't make sense to me either. I don't know what else to tell you guys.
@RX14 - I can't wrap my head around it either. I can only speculate that the network latency between the DB and the app server causes a build up of database connections which creates memory bloat and/or GC overhead for the frameworks but somehow not for crystal-raw.
For amber, the connection would be held by the ORM through the models that exist - one of the ways it diverges from crystal raw.
In any case, I think our only option at this point is to guess and check on a 3 server setup. My thoughts on possible guesses:
*Not in an isolated environment, running a virtualbox on my laptop: https://github.com/marksiemers/FrameworkBenchmarks/commit/50811f9f8712b2015ef96c073fbdde8f3125a82a
@nbrady-techempower Thank you for your help and patience.
Do you have a schedule for the next preview run and next official run?
@marksiemers forget ENTIRELY about the DB. The frameworks lag in the plaintext test which never even touches the DB.
I propose 3 tests that the techempower guys could run to help us troubleshoot this:
@RX14 It's unlikely that we'll be able to do those for you soon, since our SC environment is currently doing continuous benchmarking. We're trying to get a preview out asap while working on some other things internally. If we do bring the continuous run to a stop at some point, I'll see if I can squeeze that in for you but you would be better off trying to set up a similar environment.
@nbrady-techempower - For the 3 tests proposed by RX14, do you need anything from us, in case the opportunity opens up?
@markseimers no, but as I pointed out earlier in this thread, amber is not passing in that environment and the logs didn't provide an immediate clue as to why. You're also welcome to make a pull request to change the ports, that would show up in a run over the weekend.
That error, I also haven't been able to reproduce, unfortunately.
The good news is, since kemal is experiencing the same issue. If we solve it for that, hopefully, it will be the same fix for amber.
@nbrady-techempower - Does the PR have to be merged in, or will each open PR be run this weekend?
If each open PR is run, is it ok to open multiple PRs for the crystal frameworks? To test different hypotheses about how to fix the discrepancy between crystal-raw, kemal, and amber.
@marksiemers Unfortunately it has to be merged in, and a continuous run takes about 4 days to fully complete.
Here are the kemal results from the weekend. They look very much inline with the results I posted last week. Looks like the port did not make any difference.
@nbrady-techempower - Thank you for the prompt update. If you have any info to comment correct my hypothesis below, please weigh in.
Also, is there any time this week to try any of @RX14 suggestions of htop and/or not running the db tests for kemal and amber?
@RX14 - I know you're going to kill me if I say database one more time, but in the absence of other ideas, what about this hypotheses:
get(path, &block)
for kemal), These memory problems occur only in the frameworks because those connections stay in scope.Admittedly, this is grabbing at straws, but I'm not sure what else to look for.
AFAIK, setting a connection limit in the database url does no harm (with local testing).
@RX14 @sdogruyol @drujensen Given that, does anyone think it is worth a shot (tweaking db connection settings)?
@marksiemers @RX14 Do me a favor if you could. Fork the repo and create a branch for each different test configuration you'd like to test against. Don't open any PR's here for them, just point me to the repo with all the different configurations, and I'll see if I can get you some results for each branch. Please don't modify anything but the tests you'd like to run. The current run should finish tomorrow or Wednesday, and I'll see what I can do for you.
@nbrady-techempower @RX14 @drujensen @marksiemers
What thing that stands out to me when reading through, the backlog of comments and things that have been tried and assumed, is that; If this was a GC issue we should also see the same errors while benchmarking on the same servers. @drujensen and I tested on a 36 core server and didn't have any issues.
Also since each core is running it's own process having more cores shouldn't increase the garbage collection issues across all processes unless of course they somehow used all the memory or something.
It's weird that adding a tiny bit of local network latency would drop the results so dramatically, since it doesn't with crystal raw.
Just things to think about that might help.
@nbrady-techempower - Sure thing. What timeframe are you expecting?
@marksiemers I can fit it in on Wednesday morning probably.
EDIT: See more recent comment with the updated branch list.
@nbrady-techempower Here is the fork: https://github.com/marksiemers/FrameworkBenchmarks
There are three branches to test if you have time:
I tested all of them locally in verify mode, and everything passed, so hopefully, there won't be any issues running the benchmarks.
I don't know if we want to set the GC_MARKERS to 40. Wouldn't that launch 40 threads per process so 40x40=1600 threads? Or is this setting somehow shared between processes?
What if we set this to 2? 40x2 = 80.
@drujensen - I don't know what we want to do at all. What we've tried up this this point hasn't worked. I'm trying not to out-smart myself, and just trying to gather data about what does and doesn't work.
That said, if you think that will make a difference, I've given you access to that fork. You can add a branch or modify use-half-cpus
@marksiemers ok, i updated the GC_MARKERS=2 for all three frameworks. I also changed Crystal Raw to only use half the processors so we have something to compare against.
@drujensen @marksiemers The current run still has ~11 hours left. If you need to get anything else in to that repo, you've got some time. I'll get you some numbers tomorrow morning.
@RX14 - Any additional ideas?
Both kemal and amber use radix for routing while crystal uses a case statement. It hasn't caused issues in our testing, but maybe it makes a difference in the TechEmpower environment.
@marksiemers you're right, I compared round 15 preview 2 results again and again and I have concluded:
amber and kemal are very similar to crystal-raw on database test. I think because databases are the main overhead, no problem here 👍 results are aceptable
However, on the highest RPS tests (plaintext, json) crystal-raw has a huge difference from amber & kemal. Maybe current router radix algorithm implementation by @luislavena is causing some overhead resolving simple paths like /plaintext
& /json
. I think we can use a handler with context.request.path == "/plaintext"
instead of radix
algorithm to catch those tests in both kemal & amber. WDYT @sdogruyol ?
@faustinoaq why do you think its the radix algorithm? Is there something blocking or doing something that makes you suspect it is causing this issue?
EDIT: Or you can be smart like @faustinoaq and benchmark case
vs radix
@drujensen - Since we can't reproduce the issue, we don't have any "evidence" of what is causing it.
I've been trying to approach it more from a deduction strategy rather than debugging - since we don't really have debugging tools available at this point. Or, if you like, the Gregory House approach - just start treatment and see if it gets better.
All we can do is look at what is different between crystal-raw and the same with amber and kemal.
For plaintext and json, that list isn't too big. Routing is one. For radix, the cost in time for me to learn exactly how that shard works and speculate about what might go wrong is a relatively high cost.
The cost of coding a handler that responds before the router is touched is relatively low - and the outcome won't be hypothetical - it will be real (well as real as benchmarks can be).
@faustinoaq - Do you already have in mind how to code this?
@drujensen Yeah, I compared radix vs switch statement (aka case) used by crystal-raw here and the results say radix is almost 80x slower than a case statement finding a simple path like /plaintext
require "benchmark"
require "radix"
Tree = Radix::Tree(Symbol).new
Tree.add "/plaintext", :plaintext
def radix(path)
Tree.find(path).found?
end
def switch(path)
case path
when "/plaintext" then true
end
end
pp radix("/plaintext") # => true
pp switch("/plaintext") # => true
Benchmark.ips do |x|
x.report("radix") do
radix("/plaintext")
end
x.report("switch") do
switch("/plaintext")
end
end
$ crystal build --release --no-debug -s radix_vs_switch.cr
Parse: 00:00:00.0019190 ( 0.19MB)
Semantic (top level): 00:00:00.2499810 ( 28.12MB)
Semantic (new): 00:00:00.0012620 ( 28.12MB)
Semantic (type declarations): 00:00:00.0216570 ( 36.12MB)
Semantic (abstract def check): 00:00:00.0026250 ( 36.12MB)
Semantic (ivars initializers): 00:00:00.0032640 ( 36.12MB)
Semantic (cvars initializers): 00:00:00.0168870 ( 36.12MB)
Semantic (main): 00:00:00.2823110 ( 68.12MB)
Semantic (cleanup): 00:00:00.0010590 ( 68.12MB)
Semantic (recursive struct check): 00:00:00.0008610 ( 68.12MB)
Codegen (crystal): 00:00:00.3289690 ( 68.18MB)
Codegen (bc+obj): 00:00:18.9583230 ( 68.18MB)
Codegen (linking): 00:00:00.1888350 ( 68.18MB)
Codegen (bc+obj):
- no previous .o files were reused
$ ./radix_vs_switch
radix("/plaintext") # => true
switch("/plaintext") # => true
radix 2.12M (471.24ns) (± 6.11%) 87.19× slower
switch 185.03M ( 5.4ns) (±11.41%) fastest
$ ./radix_vs_switch
radix("/plaintext") # => true
switch("/plaintext") # => true
radix 1.91M (524.01ns) (±14.57%) 100.87× slower
switch 192.5M ( 5.19ns) (±11.91%) fastest
$ ./radix_vs_switch
radix("/plaintext") # => true
switch("/plaintext") # => true
radix 2.17M (460.29ns) (±17.94%) 83.85× slower
switch 182.18M ( 5.49ns) (±10.95%) fastest
See more result 👉 https://gist.github.com/faustinoaq/896f821f47711c6c059c5eca23030a42
@faustinoaq @marksiemers Ok, That makes sense. I didn't realize it was that much slower. I understand where your heading and think the test has some validity. If we find that radix is the cause, we can look at fixing it or replacing it.
Assuming this is our smoking gun, this is a great lesson learned.
In fact, it may inform some design decisions with amber - like our static routes - we may want to bypass or intercept a request before it hits the router.
What is still bizarre is that we didn't see the same slowdown in our tests.
@faustinoaq - What hardware did you run your tests on? Can you reproduce the TFB slowness on that machine?
Interesting findings @faustinoaq.
I think comparing a simple case statement with any URL-parsing/matching mechanism will not be fair, being that mechanism powered by Radix or any other library (🍎 != 🍊).
For example, Radix allocates several instances of Char::Reader
during lookup which might add GC pressure.
In my tests, the difference between Radix and case statements gets reduced by disabling GC (GC.disable
).
Definitely there is room for improvement in Radix, as the information shows here.
Cheers.
For the record, when I add all the routes for the benchmarks and build with the --release
flag, I'm getting a 200x + difference:
$ shards build --release
...
$ ./bin/benchmarks
radix("/plaintext") # => true
switch("/plaintext") # => true
radix 1.94M (516.29ns) (± 3.13%) 220.68× slower
switch 427.44M ( 2.34ns) (± 8.93%) fastest
radix("/json") # => true
switch("/json") # => true
radix 1.71M (585.75ns) (± 3.79%) 248.90× slower
switch 424.92M ( 2.35ns) (± 9.49%) fastest
$ ./bin/benchmarks
radix("/plaintext") # => true
switch("/plaintext") # => true
radix 1.94M ( 515.4ns) (± 2.90%) 220.19× slower
switch 427.22M ( 2.34ns) (± 8.62%) fastest
radix("/json") # => true
switch("/json") # => true
radix 2.08M (479.87ns) (± 3.61%) 205.02× slower
switch 427.24M ( 2.34ns) (± 8.77%) fastest
$ ./bin/benchmarks
radix("/plaintext") # => true
switch("/plaintext") # => true
radix 1.96M (511.16ns) (± 4.04%) 217.59× slower
switch 425.67M ( 2.35ns) (± 9.32%) fastest
radix("/json") # => true
switch("/json") # => true
radix 2.08M (480.62ns) (± 4.09%) 205.62× slower
switch 427.82M ( 2.34ns) (± 8.83%) fastest
I'm curious how much of a difference this would make in practice. Radix still pulls off a couple million in half a microsecond, so it seems like it should make little different in each request. But maybe it's the object allocations that are the issue. I think it is definitely worth testing out
@luislavena Thank you for your comment! You're right about 🍎 != 🍊, is very unfair compare radix to case statement. Of course radix have a cool features like /path/:id
and /path/*
, etc.
Do you think we can improve radix for simple paths without symbols :
or *
?
something like:
unless found_special_symbols_in_path
case path
when "/simple"
# do something
end
end
@luislavena
I think comparing a simple case statement with any URL-parsing/matching mechanism will not be fair
Agreed. and I think this particular set of paths is worst case scenario for a radix tree:
Tree = Radix::Tree(Symbol).new
Tree.add "/json", :json
Tree.add "/plaintext", :plaintext
Tree.add "/db", :db
Tree.add "/queries", :queries
Tree.add "/fortunes", :fortunes
Tree.add "/updates", :updates
Radix will remain very valuable - especially for a full blown RESTful app with hundreds of routes.
Optimizing for this kind of scenario (simple path, very high rps) it seems it is not the best tool.
I'm guessing the compiler optimizations can do a lot with those case statements - likely everything on the stack, vs the Radix tree would end up on the heap, right?
I have a local repo with the tests, would you like access to it - just in case (no pun intended), it can inform some optimizations in radix?
Do you think we can improve radix for simple paths without symbols
:
or*
?
I don't think that will be possible. Radix was originally extracted from an internal project that deals with 280+ routes (mostly GET
and POST
) and showed great performance compared with linear approach.
The original design was aimed to build big trees and compare, character by character and return earlier when required.
Adding a separate structure to deal with non-dynamic parts might be tricky, not to mention the cases that /something
(static) and /something/:id
(dynamic) will need to co-exist.
@marksiemers feel free to submit any code samples or research around Radix directly to the project, here:
https://github.com/luislavena/radix
Cheers.
@luislavena maybe we can extend the current Radix tree implementation to be an Adaptive Radix Tree see https://github.com/armon/libart and https://db.in.tum.de/~leis/papers/ART.pdf
Art should perform faster or as fast as a Hash table without the limitations.
Thank you @eliasjpr, will take a look to the paper.
Please note that neither a Radix Tree or ART by default will handle dynamic values that needs to be extracted from the trees. That is one of the main differences between a simple Radix implementation and a Radix routing library like the one I shared.
Cheers.
Hi Nate,
We are baffled by the results for the 3 crystal projects and hoping you can help shed some light on the 100k RPS results. We were expecting 1.8 to 2.0 million RPS range for these projects.
We have done multiple runs on different sized systems in AWS trying to replicate the slow response times for plaintext and json in preview 2. @sdogruyol has also ran the benchmarks and can confirm similar findings.
I'm attaching the results on a c4.xlarge 8 core system and all three projects are showing 3x the results than the preview 2. We have also ran the tests on a 64 core 1.2ghz system with 10x the requests per second (~1 million RPS) than preview 2.
We followed the instructions provided for setting up the linux environments Ubuntu 14.04 and cannot determine what the difference is that would cause the slowness.
We think that the preview 2 is not correctly reflecting the performance for these 3 projects. It's as if the multiple processes or the reuse port was not properly working. What we find interesting is the
crystal-raw
did not encounter the same issue and had 2.5m RPS which is in line with our expectations.We are aware of socket errors when there is heavy load, be we are not seeing this slow down the performance reflected in preview 2.
Any help is solving this mystery is appreciated. If possible, can you rerun the plaintext for the 3 projects and let us know if you are still seeing 100k RPS results?
Thanks, Dru
results.zip