wrk is the bottleneck for plaintext test and json test

Xudong-Huang commented 4 years ago

As we can see that the top tests of plaintext and json are so close, the wrk become the bottleneck of those tests.

I think we should make the server full, not the client full, so that we can know the real capability of the server.

To achieve that, should we give more cpu cores for wrk?

Another solution could be just run the wrk and the server on the same host, so that to make the whole system busy.

sebastienros commented 4 years ago

An option we have tried successfully is to use the database server as a second load generator fort plaintext and json. It requires synchronization but it works, and showed that ulib is more than twice as fast as the current numbers.

zloster commented 4 years ago

@Xudong-Huang See https://github.com/TechEmpower/FrameworkBenchmarks/issues/3538 and https://github.com/TechEmpower/FrameworkBenchmarks/issues/4480. 10 Gbps network is basically the bottleneck currently.

Xudong-Huang commented 4 years ago

@zloster Thanks for the info. I just test in the Azure cloud and notice that the wrk server is 100% cpu while the http server is only about 50%.

fredrikwidlund commented 4 years ago

In any case the benchmark is unable to actually benchmark the server implementations. In plaintext/physical benchmark there does seem to be a hard limit (network?) which means that the difference between the top candidates probably are things like varying response length or similar things. In the json benchmark last I looked wrk seemed to be the bottleneck.

bhauer commented 4 years ago

We are still discussing the load generation bottleneck. We are considering a few different options:

Where possible, "up-rank" the load generator hardware to compensate. In Azure, for example, this should be a fairly easy matter of using a larger instance type for the load generator than we use for the application and database servers. This is not as feasible in the physical hardware environment.
"Down-rank" the application server (and possibly the database server). In Azure, this is functionally equivalent to the above option. This may also be possible in the physical environment by constraining how many CPU cores are made available for the application server. We could elect to do the same for the database server, but we might not because we're not especially interested in measuring the performance of database servers.
Investigate if higher-performance load generators are available.
Double up the load generation by using available the database server as a second load generator client, running another instance of wrk. This would only apply to Plaintext and JSON test types so it would not impact database performance. This would have to be coordinated by the toolset and would need some process to merge the results from two instances of wrk.

talawahtech commented 4 years ago

My 2 cents from running into this issue.

1) I searched high and low, but was never able to find a more efficient load generator than wrk. I even tried some naive optimizations to the wrk source code like disabling parsing but saw very little improvement (maybe 5-10%). If a better load generator is found I would definitely love to know about it.

2) I ended up needing at least a 2:1 ratio between the client and server specs in order for the client to not be the bottleneck for the top json and plaintext tests.

sebastienros commented 4 years ago

@talawahtech The wrk limitation on plaintext is probably coming from the fact that pipelining is handled by a LUA script. It its early versions this was a feature that was handled natively. Earlier this year the script has been optimized and when the switch was done I could see much better perf, still bottlenecked. Based on that I think we should try to update wrk to add the native pipeline support.

However this won't solve the json scenario. Our current approach in the ASP.NET team is to use two clients machines (as mentioned by @bhauer too) which in this case solves every non-database scenarios ... unless for Ulib which is still putting two clients on their knees surprisingly.

As for the simplest solution, I think it's definitely to decrease the number of available cores on the server, as docker supports it oob. But I haven't gathered numbers yet.

talawahtech commented 4 years ago

@sebastienros I may be mistaken, but I am not seeing a version of wrk that has native support for pipelining. The example that they provide in their scripts directory is pretty much the same approach as TechEmpower.

According to the docs, since the work involved in setting up the pipelined request happens in the init() function (which is only called once) the performance impact should be minimal.

talawahtech commented 4 years ago

@sebastienros how do you guys aggregate the results from the two different machines running wrk? If it is something that you can share I would love to take a look.

sebastienros commented 4 years ago

Before lua support was added to wrk, it could do pipelining. Take a look at these PRs: https://github.com/wg/wrk/pull/57 https://github.com/wg/wrk/pull/36

sebastienros commented 4 years ago

We have a jobs queue of both machines, and a "driver" app that orchestrate these, once the two jobs are ready to start, we send the command simultaneously, then aggregate the results from wrk. The obvious issue is that we need to trust that both results were issued in the same time frame. So it's not as precise as having a single instance, but the results were consistent when I tried it of different target frameworks.

zloster commented 4 years ago

As for the simplest solution, I think it's definitely to decrease the number of available cores on the server, as docker supports it oob. But I haven't gathered numbers yet.

There are several tools on Linux that provide this functionality. numactl is one of them. Quick search gave me this short overview.

fredrikwidlund commented 4 years ago

One important point is that the node creating the load (currently wrk) should have more resources than the server.

Lowering # cores should work, but care should be taken to chose the same cores, perhaps avoid sharing the same physical cpu amongst logical cpus (hyperthreads) and so forth.

Multiple worker nodes make a lot of sense. To avoid overlap timing issues one could add a margin of one second or so when starting and stopping and excluding the margin from the result.

I measured libreactor way back and needed two load generating nodes with the same spec as the server to saturate it, so 2:1 makes sense.

talawahtech commented 3 years ago

FYI Fredrik also created a high-performance benchmarking tool called pounce. I've only done some preliminary testing, but I saw a 20-25% improvement compared to wrk.

dralley commented 2 years ago

I would recommend using rewrk

For other reasons as well -- wrk has flaws with the way it does latency measurements which cause results to not be correct.

wrk's model, which is similar to the model found in many current load generators, computes the latency for a given request as the time from the sending of the first byte of the request to the time the complete response was received.

While this model correctly measures the actual completion time of individual requests, it exhibits a strong Coordinated Omission effect, through which most of the high latency artifacts exhibited by the measured server will be ignored. Since each connection will only begin to send a request after receiving a response, high latency responses result in the load generator coordinating with the server to avoid measurement during high latency periods.

wrk2 fixed that problem (and does a thorough job of explaining it, I pulled that quote from there) but rewrk also takes that into consideration while adding support for HTTP 2 and many other benefits.

TechEmpower / FrameworkBenchmarks

wrk is the bottleneck for plaintext test and json test #5207