Amber, Kemal, Kemalyst results in question

drujensen commented 7 years ago

Hi Nate,

We are baffled by the results for the 3 crystal projects and hoping you can help shed some light on the 100k RPS results. We were expecting 1.8 to 2.0 million RPS range for these projects.

We have done multiple runs on different sized systems in AWS trying to replicate the slow response times for plaintext and json in preview 2. @sdogruyol has also ran the benchmarks and can confirm similar findings.

I'm attaching the results on a c4.xlarge 8 core system and all three projects are showing 3x the results than the preview 2. We have also ran the tests on a 64 core 1.2ghz system with 10x the requests per second (~1 million RPS) than preview 2.

We followed the instructions provided for setting up the linux environments Ubuntu 14.04 and cannot determine what the difference is that would cause the slowness.

We think that the preview 2 is not correctly reflecting the performance for these 3 projects. It's as if the multiple processes or the reuse port was not properly working. What we find interesting is the crystal-raw did not encounter the same issue and had 2.5m RPS which is in line with our expectations.

We are aware of socket errors when there is heavy load, be we are not seeing this slow down the performance reflected in preview 2.

Any help is solving this mystery is appreciated. If possible, can you rerun the plaintext for the 3 projects and let us know if you are still seeing 100k RPS results?

Thanks, Dru

results.zip

sdogruyol commented 7 years ago

Thanks for this @drujensen :+1: I've also done some small runs and can second this issue, unfortunately I don't have any logs :cry:

NateBrady23 commented 7 years ago

Hi @drujensen! We are investigating some issues now with r15-p2 and are doing some reruns. I'll let you know as soon as I see some results and we will go from there.

NateBrady23 commented 7 years ago

@drujensen @sdogruyol Looks like we found the issue. We've started a new run and amber has completed but it will be some time before we get a full picture for R15-P3. Here are the results:

"plaintext": {
      "amber": [
        {
          "latencyAvg": "23.70ms", 
          "latencyMax": "114.14ms", 
          "latencyStdev": "12.06ms", 
          "totalRequests": 1448302, 
          "startTime": 1504802515, 
          "endTime": 1504802530
        }, 
        {
          "latencyAvg": "87.41ms", 
          "latencyMax": "404.84ms", 
          "latencyStdev": "50.07ms", 
          "totalRequests": 1629658, 
          "startTime": 1504802532, 
          "endTime": 1504802547
        }, 
        {
          "latencyAvg": "336.33ms", 
          "latencyMax": "1.01s", 
          "latencyStdev": "180.85ms", 
          "totalRequests": 1621857, 
          "startTime": 1504802549, 
          "endTime": 1504802564
        }, 
        {
          "latencyAvg": "1.30s", 
          "latencyMax": "4.72s", 
          "latencyStdev": "694.03ms", 
          "totalRequests": 1596471, 
          "startTime": 1504802566, 
          "endTime": 1504802581
        }
      ],

    "db": {
      "amber": [
        {
          "latencyAvg": "582.47us", 
          "latencyMax": "5.86ms", 
          "latencyStdev": "589.15us", 
          "totalRequests": 252698, 
          "startTime": 1504801882, 
          "endTime": 1504801897
        }, 
        {
          "latencyAvg": "603.34us", 
          "latencyMax": "7.06ms", 
          "latencyStdev": "664.24us", 
          "totalRequests": 506598, 
          "startTime": 1504801899, 
          "endTime": 1504801914
        }, 
        {
          "latencyAvg": "748.65us", 
          "latencyMax": "23.61ms", 
          "latencyStdev": "1.12ms", 
          "totalRequests": 912180, 
          "startTime": 1504801916, 
          "endTime": 1504801931
        }, 
        {
          "latencyAvg": "3.60ms", 
          "latencyMax": "87.26ms", 
          "latencyStdev": "7.44ms", 
          "totalRequests": 950311, 
          "startTime": 1504801933, 
          "endTime": 1504801949
        }, 
        {
          "latencyAvg": "22.57ms", 
          "latencyMax": "198.01ms", 
          "latencyStdev": "38.84ms", 
          "totalRequests": 965665, 
          "startTime": 1504801951, 
          "endTime": 1504801966
        }, 
        {
          "latencyAvg": "93.45ms", 
          "latencyMax": "656.78ms", 
          "latencyStdev": "141.04ms", 
          "totalRequests": 998851, 
          "startTime": 1504801968, 
          "endTime": 1504801983
        }
      ],

    "update": {
      "amber": [
        {
          "latencyAvg": "18.10ms", 
          "latencyMax": "278.91ms", 
          "latencyStdev": "19.11ms", 
          "totalRequests": 269955, 
          "startTime": 1504802265, 
          "endTime": 1504802280
        }, 
        {
          "latencyAvg": "65.79ms", 
          "latencyMax": "323.45ms", 
          "latencyStdev": "37.01ms", 
          "totalRequests": 59538, 
          "startTime": 1504802282, 
          "endTime": 1504802297
        }, 
        {
          "latencyAvg": "125.15ms", 
          "latencyMax": "435.86ms", 
          "latencyStdev": "51.19ms", 
          "totalRequests": 30681, 
          "startTime": 1504802299, 
          "endTime": 1504802314
        }, 
        {
          "latencyAvg": "184.44ms", 
          "latencyMax": "507.47ms", 
          "latencyStdev": "61.51ms", 
          "totalRequests": 20706, 
          "startTime": 1504802316, 
          "endTime": 1504802331
        }, 
        {
          "latencyAvg": "245.74ms", 
          "latencyMax": "607.97ms", 
          "latencyStdev": "71.19ms", 
          "totalRequests": 15493, 
          "startTime": 1504802333, 
          "endTime": 1504802348
        }
      ],

    "json": {
      "amber": [
        {
          "latencyAvg": "129.93us", 
          "latencyMax": "2.97ms", 
          "latencyStdev": "74.34us", 
          "totalRequests": 984697, 
          "startTime": 1504802016, 
          "endTime": 1504802031
        }, 
        {
          "latencyAvg": "133.28us", 
          "latencyMax": "2.86ms", 
          "latencyStdev": "77.50us", 
          "totalRequests": 1916676, 
          "startTime": 1504802033, 
          "endTime": 1504802048
        }, 
        {
          "latencyAvg": "186.28us", 
          "latencyMax": "8.57ms", 
          "latencyStdev": "118.22us", 
          "totalRequests": 2707365, 
          "startTime": 1504802050, 
          "endTime": 1504802065
        }, 
        {
          "latencyAvg": "615.00us", 
          "latencyMax": "9.95ms", 
          "latencyStdev": "260.15us", 
          "totalRequests": 1582804, 
          "startTime": 1504802067, 
          "endTime": 1504802082
        }, 
        {
          "latencyAvg": "1.28ms", 
          "latencyMax": "16.54ms", 
          "latencyStdev": "849.91us", 
          "totalRequests": 1610007, 
          "startTime": 1504802084, 
          "endTime": 1504802099
        }, 
        {
          "latencyAvg": "2.33ms", 
          "latencyMax": "35.07ms", 
          "latencyStdev": "1.06ms", 
          "totalRequests": 1668241, 
          "startTime": 1504802101, 
          "endTime": 1504802116
        }
      ],

    "query": {
      "amber": [
        {
          "latencyAvg": "95.62ms", 
          "latencyMax": "575.93ms", 
          "latencyStdev": "144.30ms", 
          "totalRequests": 1000349, 
          "startTime": 1504802149, 
          "endTime": 1504802164
        }, 
        {
          "latencyAvg": "110.83ms", 
          "latencyMax": "620.93ms", 
          "latencyStdev": "160.90ms", 
          "totalRequests": 200564, 
          "startTime": 1504802166, 
          "endTime": 1504802181
        }, 
        {
          "latencyAvg": "121.41ms", 
          "latencyMax": "691.91ms", 
          "latencyStdev": "164.79ms", 
          "totalRequests": 99223, 
          "startTime": 1504802183, 
          "endTime": 1504802198
        }, 
        {
          "latencyAvg": "131.78ms", 
          "latencyMax": "674.72ms", 
          "latencyStdev": "169.81ms", 
          "totalRequests": 64963, 
          "startTime": 1504802200, 
          "endTime": 1504802215
        }, 
        {
          "latencyAvg": "146.27ms", 
          "latencyMax": "781.06ms", 
          "latencyStdev": "174.94ms", 
          "totalRequests": 48586, 
          "startTime": 1504802217, 
          "endTime": 1504802233
        }
      ],

Sorry to put the scare in everyone, but this is exactly why we do preview runs!

drujensen commented 7 years ago

Hi Nate,

Thanks for looking into this. Unfortunately, this still looks the same to me. This is showing ~100k per second but i’m expecting 1.8m per second given my own benchmarking. We are getting ~400k per second on an 8 core system.

I’m not sure where to go from here. Any insight would be helpful.

Thanks, Dru

On Sep 7, 2017, at 11:10 AM, Nate notifications@github.com wrote:

@drujensen https://github.com/drujensen @sdogruyol https://github.com/sdogruyol Looks like we found the issue. We've started a new run and amber has completed but it will be some time before we get a full picture for R15-P3. Here are the results:

"plaintext": { "amber": [ { "latencyAvg": "23.70ms", "latencyMax": "114.14ms", "latencyStdev": "12.06ms", "totalRequests": 1448302, "startTime": 1504802515, "endTime": 1504802530 }, { "latencyAvg": "87.41ms", "latencyMax": "404.84ms", "latencyStdev": "50.07ms", "totalRequests": 1629658, "startTime": 1504802532, "endTime": 1504802547 }, { "latencyAvg": "336.33ms", "latencyMax": "1.01s", "latencyStdev": "180.85ms", "totalRequests": 1621857, "startTime": 1504802549, "endTime": 1504802564 }, { "latencyAvg": "1.30s", "latencyMax": "4.72s", "latencyStdev": "694.03ms", "totalRequests": 1596471, "startTime": 1504802566, "endTime": 1504802581 } ], "db": { "amber": [ { "latencyAvg": "582.47us", "latencyMax": "5.86ms", "latencyStdev": "589.15us", "totalRequests": 252698, "startTime": 1504801882, "endTime": 1504801897 }, { "latencyAvg": "603.34us", "latencyMax": "7.06ms", "latencyStdev": "664.24us", "totalRequests": 506598, "startTime": 1504801899, "endTime": 1504801914 }, { "latencyAvg": "748.65us", "latencyMax": "23.61ms", "latencyStdev": "1.12ms", "totalRequests": 912180, "startTime": 1504801916, "endTime": 1504801931 }, { "latencyAvg": "3.60ms", "latencyMax": "87.26ms", "latencyStdev": "7.44ms", "totalRequests": 950311, "startTime": 1504801933, "endTime": 1504801949 }, { "latencyAvg": "22.57ms", "latencyMax": "198.01ms", "latencyStdev": "38.84ms", "totalRequests": 965665, "startTime": 1504801951, "endTime": 1504801966 }, { "latencyAvg": "93.45ms", "latencyMax": "656.78ms", "latencyStdev": "141.04ms", "totalRequests": 998851, "startTime": 1504801968, "endTime": 1504801983 } ], "update": { "amber": [ { "latencyAvg": "18.10ms", "latencyMax": "278.91ms", "latencyStdev": "19.11ms", "totalRequests": 269955, "startTime": 1504802265, "endTime": 1504802280 }, { "latencyAvg": "65.79ms", "latencyMax": "323.45ms", "latencyStdev": "37.01ms", "totalRequests": 59538, "startTime": 1504802282, "endTime": 1504802297 }, { "latencyAvg": "125.15ms", "latencyMax": "435.86ms", "latencyStdev": "51.19ms", "totalRequests": 30681, "startTime": 1504802299, "endTime": 1504802314 }, { "latencyAvg": "184.44ms", "latencyMax": "507.47ms", "latencyStdev": "61.51ms", "totalRequests": 20706, "startTime": 1504802316, "endTime": 1504802331 }, { "latencyAvg": "245.74ms", "latencyMax": "607.97ms", "latencyStdev": "71.19ms", "totalRequests": 15493, "startTime": 1504802333, "endTime": 1504802348 } ], "json": { "amber": [ { "latencyAvg": "129.93us", "latencyMax": "2.97ms", "latencyStdev": "74.34us", "totalRequests": 984697, "startTime": 1504802016, "endTime": 1504802031 }, { "latencyAvg": "133.28us", "latencyMax": "2.86ms", "latencyStdev": "77.50us", "totalRequests": 1916676, "startTime": 1504802033, "endTime": 1504802048 }, { "latencyAvg": "186.28us", "latencyMax": "8.57ms", "latencyStdev": "118.22us", "totalRequests": 2707365, "startTime": 1504802050, "endTime": 1504802065 }, { "latencyAvg": "615.00us", "latencyMax": "9.95ms", "latencyStdev": "260.15us", "totalRequests": 1582804, "startTime": 1504802067, "endTime": 1504802082 }, { "latencyAvg": "1.28ms", "latencyMax": "16.54ms", "latencyStdev": "849.91us", "totalRequests": 1610007, "startTime": 1504802084, "endTime": 1504802099 }, { "latencyAvg": "2.33ms", "latencyMax": "35.07ms", "latencyStdev": "1.06ms", "totalRequests": 1668241, "startTime": 1504802101, "endTime": 1504802116 } ], "query": { "amber": [ { "latencyAvg": "95.62ms", "latencyMax": "575.93ms", "latencyStdev": "144.30ms", "totalRequests": 1000349, "startTime": 1504802149, "endTime": 1504802164 }, { "latencyAvg": "110.83ms", "latencyMax": "620.93ms", "latencyStdev": "160.90ms", "totalRequests": 200564, "startTime": 1504802166, "endTime": 1504802181 }, { "latencyAvg": "121.41ms", "latencyMax": "691.91ms", "latencyStdev": "164.79ms", "totalRequests": 99223, "startTime": 1504802183, "endTime": 1504802198 }, { "latencyAvg": "131.78ms", "latencyMax": "674.72ms", "latencyStdev": "169.81ms", "totalRequests": 64963, "startTime": 1504802200, "endTime": 1504802215 }, { "latencyAvg": "146.27ms", "latencyMax": "781.06ms", "latencyStdev": "174.94ms", "totalRequests": 48586, "startTime": 1504802217, "endTime": 1504802233 } ], Sorry to put the scare in everyone, but this is exactly why we do preview runs!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TechEmpower/FrameworkBenchmarks/issues/2960#issuecomment-327878442, or mute the thread https://github.com/notifications/unsubscribe-auth/AABJHp31wIhG_7VGcoqyIrZo5nMU9AN1ks5sgDBNgaJpZM4PN5lo.

elorest commented 7 years ago

Nate,

To reiterate what @drujensen said above: We're getting 400k r/s after installing tech empower benchmarks and setting that up as per the instructions. It seems strange to that our 8 core VPS gets faster benchmarks than your 80 core server. Especially since on our tests we're only about 20% slower than crystal-raw which your benchmarks show at 2.5 million.

Thanks for your feedback,

Isaac Sloan

NateBrady23 commented 7 years ago

Sorry guys, I misunderstood. Could you paste me the benchmark.cfg you are using in the framework root. Are you guys using a 3 machine set up? Our App server, Client, and DB are on separate machines. And have you also tested crystal in your environment?

faustinoaq commented 7 years ago

~~Hey guys @drujensen @elorest Can you read this comment again?~~

~~He is showing us good results for plaintext, at least much better than before, closer to crystal-raw~~

"plaintext": {
      "amber": [
        {
          "latencyAvg": "23.70ms", 
          "latencyMax": "114.14ms", 
          "latencyStdev": "12.06ms", 
          "totalRequests": 1448302,   # => 1.44M
          "startTime": 1504802515, 
          "endTime": 1504802530
        }, 
...
        {
          "latencyAvg": "1.30s", 
          "latencyMax": "4.72s", 
          "latencyStdev": "694.03ms", 
          "totalRequests": 1596471,   # => 1.59M
          "startTime": 1504802566, 
          "endTime": 1504802581
        }
      ],

~~Also json is good:~~

    "json": {
      "amber": [
        {
          "latencyAvg": "129.93us", 
          "latencyMax": "2.97ms", 
          "latencyStdev": "74.34us", 
          "totalRequests": 984697,   # => 984K
          "startTime": 1504802016, 
          "endTime": 1504802031
        },
... 
        {
          "latencyAvg": "2.33ms", 
          "latencyMax": "35.07ms", 
          "latencyStdev": "1.06ms", 
          "totalRequests": 1668241,   # => 1.66M
          "startTime": 1504802101, 
          "endTime": 1504802116
        }
      ], 
...

Thanks you @nbrady-techempower for you support! :heart:

sdogruyol commented 7 years ago

@faustinoaq that's total requests

1.44M / 15 ~= 100K RPS

faustinoaq commented 7 years ago

@sdogruyol My bad :sweat_smile:

NateBrady23 commented 7 years ago

@elorest @drujensen can you confirm whether you're on a 1 or 3 machine setup and specifically, what is your ULIMIT max? We've set a very high ULIMIT on these machines and that specifically helped the performance of a lot of frameworks. After looking closely at wrk results you zipped up and the ones being generated from our environment, it seems like you may have a single machine setup with reduced latency and a smaller ULIMIT size which stunted crystal-raw.

drujensen commented 7 years ago

Nate,

We only used a single machine and used the default u limit.

We will rerun on two boxes since we are not concerned with the 4 db tests and up the u limit to max.

I doubt that crystal-raw was some way limited by ulimits but we will confirm.

Question is why running on one box would be faster than 2, and why we are seeing 4x faster on an 8 core vs 80 core, but we will confirm.

elorest commented 7 years ago

We ran on a single VPS with other machines forwarded to 127.0.0.1 in the hosts file. Ulimit said unlimited.

On Sep 7, 2017 6:32 PM, "Dru Jensen" notifications@github.com wrote:

Nate,

We only used a single machine and used the default u limit.

We will rerun on two boxes since we are not concerned with the 4 db tests and up the u limit to max.

I doubt that crystal-raw was some way limited by ulimits but we will confirm.

Question is why running on one box would be faster than 2, and why we are seeing 4x faster on an 8 core vs 80 core, but we will confirm.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TechEmpower/FrameworkBenchmarks/issues/2960#issuecomment-327966729, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHmpIEHJAZDs0C0zGJ_0Zrjhh_cgUrIks5sgIsSgaJpZM4PN5lo .

NateBrady23 commented 7 years ago

@drujensen From the client running wrk to a server on a different machine simulates a real request over the internet. Hitting localhost with wrk on the same machine is going to give you much better results?

I don't think there's such a thing as ulimit=unlimited.

Also, as another note, the client running work is 8 HT cores. The server hosting your application has 80 HT cores.

faustinoaq commented 7 years ago

From the client running wrk to a server on a different machine simulates a real request over the internet.

@nbrady-techempower I think :point_up_2: is reasonable but the thing here is that results in plaintext and json with kemal && amber. seems to be too far of crystal-raw results.

So, if crystal-raw is getting 1.5M RPS we are expecting at least 50% of that (say 700K RPS) not a 10% as the preview 2 is showing us (100K RPS). I think we can find the problem :smile:

BTW, Thanks you a lot for maintaining this awesome project! :heart:

NateBrady23 commented 7 years ago

@faustinoaq I understand what you're saying. I can give you guys other environment information as you need it. There must be something that crystal-raw is taking advantage of in that environment? I'm not sure without digging deep in to each framework, which unfortunately I can't do at the moment. But let me know if there's anything else I can get you.

drujensen commented 6 years ago

@nbrady-techempower One theory we have is that we have too many GC threads running. Crystal by default launches 16 threads per process.

Crystal has a configuration setting to reduce the number of GC threads, but unfortunately, we don't have a way of checking to see if this is the issue since we can't duplicate the issue.

Is there a way we can make the changes and then request a run of amber and kemal to determine if these changes fix the slowness before another round?

NateBrady23 commented 6 years ago

@drujensen Absolutely! We actually have some downtime at the moment because Server Central is moving our environment to a different location, so feel free to open a PR with those changes as there will be a few more preview runs before official results are released.

drujensen commented 6 years ago

@nbrady-techempower Thanks for your quick response! Can you run this PR against the servers and provide results?

foliot commented 6 years ago

I notice that for Kemal and Amber, it looks like you are converting the result to a JSON string and returning that, which would have to be later written to the IO for the response. The standard Crystal version writes directly to the output IO instead of constructing an intermediate string. Could that explain the discrepancy?

elorest commented 6 years ago

@foliot You might be on to something. I could especially see the queries one being affected by that.

amber

  def queries
    response.content_type = JSON
    queries = params["queries"]
    queries = queries.to_i? || 1
    queries = queries.clamp(1..500)

    results = (1..queries).map do
      if world = World.find rand(1..ID_MAXIMUM)
        {id: world.id, randomNumber: world.randomNumber}
      end
    end

    results.to_json
  end

vs crystal

  when "/queries"
    response.status_code = 200
    response.headers["Content-Type"] = "application/json"

    JSON.build(response) do |json|
      json.array do
        sanitized_query_count(request).times do
          random_world.to_json(json)
        end
      end
    end

elorest commented 6 years ago

All though I'm pretty sure that the hello world routes are also being slow and breaking pipes.

elorest commented 6 years ago

@drujensen what do you think?

faustinoaq commented 6 years ago

@elorest See https://github.com/TechEmpower/FrameworkBenchmarks/pull/2891

@sdogruyol tried it on Kemal before

RX14 commented 6 years ago

Broken pipe is a misleading error, they happen when wrk shuts down and closes the tcp connection while crystal is still writing a response. It's not going to affect performance.

foliot commented 6 years ago

@faustinoaq Ah yea, I thought I remembered @sdogruyol trying that. It looks like the problem was that he wrote the JSON to IO in the handler, while also returning a value that the framework would write to IO. There was some sort of problem when he wrote to the IO twice.

Couldn't you return a wrapper object with a to_str(io) method that would do nothing but call results.to_json(io)? Then there should be no intermediate string created. Unless, of course, the framework also creates an intermediate string from the return value instead of writing it straight to IO...

You could standardize this by making various response classes such as JSONResponse and PlainTextResponse so you would just have to return JSONResponse.new results. You could also have a content_type method on the response object so you wouldn't have to set that manually in the handler.

foliot commented 6 years ago

Now that I look at it again, it doesn't look like it's the JSON benchmarks where Kemal is struggling.. Maybe the IO stuff isn't much of a performance impact after all.

EDIT: Never mind.. that is one of the ones it's struggling on. I'm misreading things.

elorest commented 6 years ago

There's a ticket where we're working on that. We do have an issue though we can't write to response if we care about Handler fall through. Once you write the response it can't be modified. If this were an issue we could probably find a way around it for benchmarks.

On Oct 28, 2017 1:06 PM, "foliot" notifications@github.com wrote:

@faustinoaq https://github.com/faustinoaq Ah yea, I thought I remembered @sdogruyol https://github.com/sdogruyol trying that. It looks like the problem was that he wrote the JSON to IO in the handler, while also returning a value that the framework would write to IO. There was some sort of problem when he wrote to the IO twice.

Couldn't you return a wrapper object with a to_str(io) method that would do nothing but call results.to_json(io)? Then there should be no intermediate string created. Unless, of course, the framework also creates an intermediate string from the return value instead of writing it straight to IO...

You could standardize this by making various response classes such as JSONResponse and PlainTextResponse so you would just have to return JSONResponse.new results. You could also have a content_type method on the response object so you wouldn't have to set that manually in the handler.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TechEmpower/FrameworkBenchmarks/issues/2960#issuecomment-340212991, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHmpDaXUlGUwxzCX6e16tARvrhIXtMlks5sw3tHgaJpZM4PN5lo .

faustinoaq commented 6 years ago

Unless, of course, the framework also creates an intermediate string from the return value instead of writing it straight to IO...

@foliot You can avoid return value using Nil as return type:

def foo : Nil
  "foo"
end

pp foo # => nil

https://carc.in/#/r/2zd4

Update: I misread the comment 😅

Also exist NoReturn type, but is used only for low level stuff

faustinoaq commented 6 years ago

Another interesting reference to @RX14 comments about GC via IRC:

<RX14> if you have 1 process with 80 threads you have 1 GC which pauses every thread
<RX14> does it's work
<RX14> and then finishes
<RX14> if you have 80 processes you have 80 seperate GCs
<c-c> ok, I'm not entirely lost
<RX14> which means much higher overhead
<RX14> especially when the off-the-shelf GC we uses uses concurrent garbage collection
<RX14> so suddenly your singlethreaded process is trying to use 16 threads to GC 1 thread
<RX14> which is... suboptimal
<RX14> it muddies the scheduler

NateBrady23 commented 6 years ago

Hey guys, wanted to let you know we have a run going. However, amber did fail. I haven't been able to dig in to the logs so here's the output if you want to take a look: http://sprunge.us/PXUM

Also here's a quick snapshot of Kemalyst vs last round:

I'll drop some of the raw json data here a little later today

drujensen commented 6 years ago

@nbrady-techempower Thanks for the results. This didn't seem to help.

@faustinoaq @RX14 I think at this point we need to consider reducing the number of processes we launch. Currently we are launching a process per core. WDYT about launching 1/2 to 3/4 that so 1 process for every 2 cores?

RX14 commented 6 years ago

@nbrady-techempower The results of crystal-raw compared to previous round would be great.

We know that crystal itself performs well, perhaps you should focus on reducing garbage allocation throughput inside your frameworks first.

NateBrady23 commented 6 years ago

Here's a link to the results.json for the run still in progress. http://sprunge.us/HNHc

RX14 commented 6 years ago

@nbrady-techempower please can you place a link to the place on the website where you can enter results JSON and visualize it, I seem to have lost my link.

NateBrady23 commented 6 years ago

@RX14 https://www.techempower.com/benchmarks/#section=test

RX14 commented 6 years ago

@nbrady-techempower it'd be great if there was a link to that somewhere on the site :)

msmith-techempower commented 6 years ago

@RX14 Fair point

NateBrady23 commented 6 years ago

@RX14 I added it to the readme here which is the best i can do at the moment. I'll make a note of that for the site.

RX14 commented 6 years ago

Looks like crystal-raw got a pretty good boost there, but all the frameworks still perform terribly. That's very weird, perhaps some more in-person debugging is required. A basic check would be to run only the kemal test, and check htop to see if all the CPU cores are utilised. Apart from that, I can't think what could be the problem that it performs 20x slower...

elorest commented 6 years ago

@RX14 It's interesting because if I run the crystal, kemal, amber tests locally they all come out about the same with raw crystal being about 5% faster. Seems that benchmarking from another computer over the network is what changes that somehow. Not sure why crystal raw would handle it fine but amber and kemal would blow up though.

marksiemers commented 6 years ago

Two points:

Making the issue reproducible
Database calls

Making the issue reproducible

@elorest @drujensen @sdogruyol What is holding us back from creating the same environment so we can get a feedback loop on fixing things? Is it because it is custom hardware rather than an AWS instance?

@nbrady-techempower - Are there tests planned for AWS (or other providers) - that would be easier to reproduce on our own?

Database calls

TL;DR If this wasn't already done, using prepared statements for the database queries should really help performance on those tests.

@nbrady-techempower - Are we allowed to use any indexing/tuning strategies that we want on the database?

@RX14 @elorest @drujensen @faustinoaq @sdogruyol crystal-raw is way faster until a database is involved. It is in the 80%+ range, then drops into the 40% range for reads, then 13.5% for writes. I think this needs to be addressed.

A few questions about this:

Does this mean crystal-raw could be 2x-20x faster for the database tests?
Has the database been tuned in any way? Prepared statements, indexing, etc.?
If these numbers were to be improved would it have to be in crystal-pg or crystal-db? Does anyone have time to setup a mysql test?
```
Single Query:
crystal | 77,224 | 100.0%(40.8%) 
kemal (postgresql) | 75,806 | 98.2%(40.0%)
```

Multiple Queries: kemal (postgresql) | 3,781 | 100.0%(41.2%) crystal | 3,745 | 99.0%(40.8%)

Fortunes: crystal | 88,041 | 100.0%(49.4%) kemal (postgresql) | 82,572 | 93.8%(46.4%)

Data Updates crystal | 1,155 | 100.0%(13.5%) kemal (postgresql) | 1,141 | 98.8%(13.3%)

With no db involved, crystal-raw is 2x-20x faster as RX-14 mentioned:

Plaintext: crystal | 2,134,425 | 100.0%(83.2%) kemal (postgresql) | 111,917 | 5.2%(4.4%)

JSON Serialization crystal | 528,217 | 100.0%(85.5%) kemal (postgresql) | 192,474 | 36.4%(31.1%)

msmith-techempower commented 6 years ago

Are we allowed to use any indexing/tuning strategies that we want on the database?

No

Has the database been tuned in any way? Prepared statements, indexing, etc.?

Only primary keys on the index column for each table. Here is the setup for the MySQL database and tables, for reference.

Are there tests planned for AWS (or other providers) - that would be easier to reproduce on our own?

Microsoft Azure, though I am not sure of the specs off hand

marksiemers commented 6 years ago

@msmith-techempower - It looks like we can use prepared statements though:

Use of prepared statements for SQL database tests (e.g., for MySQL) is encouraged but not required.

Source: https://www.techempower.com/benchmarks/#section=code

Is that the only thing allowed? Is it allowed for postgres? What about postgresql functions?

NateBrady23 commented 6 years ago

@marksiemers As far as the specifications go, we can't predict everything someone will come up with to get an edge, but we've done our best to capture everything we don't allow. If you have questions about anything in particular, or if something seems unclear please let us know.

As far as Azure, like @msmith-techempower we don't have specs for those environments off-hand, and to be honest, we're not sure when exactly those will be live.

In the meantime I'm happy to get you any information about our SC environment that we haven't already captured.

RX14 commented 6 years ago

@marksiemers There's 2 issues here, optimizing crystal-raw and reproducing the problems frameworks have keeping up with crystal-raw. Please keep these issues seperate. Optimizations for crystal-raw would be appreciated (I think the bottleneck is much more around the unoptimized DB driver and connection pool, along with having a connection pool for each of 80 processes), but it's off-topic and distracting for this thread.

It looks like we need to devise some tests for techempower to run on their hardware, since they're the only ones who can reproduce. I offered my suggestion - a quick test to see if all CPU cores were fully utilized.

marksiemers commented 6 years ago

Just ran the benchmarks locally for crystal-raw, kemal, and amber.

results.json here: http://sprunge.us/iQXe

All passed successfully. (based on commit d7ab76a27e7f830923f8729673ed6ec7fa09f662)

Plaintext:
crystal-raw: 419,614 (100%)
amber:       278,086 (66.2%)
kemal:       245,115 (58.4%)

JSON
crystal-raw: 268,346 (100%)
amber:       206,898 (77.1%)
kemal:       182,925 (68.1%)

db
crystal-raw: 18,625 (100%)
amber:       17,783 (95.4%)
kemal:       16,646 (89.3%)

This is inline with our expectations of at least 50% of crystal-raw's performance.

I looked through the code and the setup.sh between crystal-raw and amber. Nothing stands out to me.

It looks like with server central, these are 40 core machines for the app server. Not that it should have an effect on this crystal vs framework discrepancy, but if we are thinking about not touching all cores, 40 should be the starting point.

drujensen commented 6 years ago

@marksiemers That is correct and is why our expectation is ~1m requests per second for Amber/Kemal on plaintext and json. We have seen ~1m requests per second using Amber on smaller boxes.

@RX14 is right in that both frameworks are allocating more memory than crystal-raw and we could do better at memory management in both frameworks. It may be that the cause is the GC blocking to perform cleanup more often but I would have expected the same poor performance on smaller boxes, which is not the case.

faustinoaq commented 6 years ago

@drujensen I think the mix between boxes latency and GC overhead could be the clue of Crystal's framework slowdown.

Maybe fill all cores isn't a good Idea, Maybe the OS also requires some free resources to perform some other operations.

drujensen commented 6 years ago

@faustinoaq I think its worth a try. Lets divide the number of processes by 2 and see if that makes an improvement. Do you want to attempt at making the change?

marksiemers commented 6 years ago

Let's binary search the right number of cores!

It is still weird that it doesn't reproduce, but definitely worth a shot.

TechEmpower / FrameworkBenchmarks

Amber, Kemal, Kemalyst results in question #2960

Making the issue reproducible

Database calls