consider using wrk2 for raw or CO corrected percentiles

ceeaspb commented 9 years ago

consider using https://github.com/giltene/wrk2 for CO corrected percentiles, related to #31 and #854.

msmith-techempower commented 9 years ago

Woof, wrk2 seems to be the same as wrk but with more output and the potential to test throughput as opposed to raw load (as we do now).

Pinging @hamiltont and @bhauer to get some opinions.

msmith-techempower commented 9 years ago

In terms of portability, it looks pretty simple at a high level. The same values we are searching for are present in the HUGE output. The raw files would become MUCH larger, but this would give us more data to parse.

I will spin a branch today and see how possible this is and get some data to @hamiltont to see what he thinks (he generally is the one to tell me whether the pile of numbers I'm looking at make any sense or not).

bhauer commented 9 years ago

I'm all for it. Not for Round 10; we have already made the Round 10 hurdle way too high and we're still struggling to clear it. But for Round 11, I'd like to be able to capture and render more detailed information.

hamiltont commented 9 years ago

Looks really fascinating, thanks for linking to this project. I don't have time to dig through all of the thread on this coordinated emission stuff, but here's some unsorted thoughts:

It seems like wrk2 focuses on load testing (note the --rate parameter) whereas wrk focuses on stress testing. Can others confirm/deny this? If this is true it's a pretty a big change from wrk
Are there any numbers verifying wrk2 to wrk results? Anything to quickly verify the two systems are at least in the same ballpark, such as proving they are equal on a small test program that intentionally avoids the CO problem wrk2 is designed to correct?
Biggerraw files should be of no concern (as long as we're not talking 1-2 GB extra per test). IMO, we should always opt to collect more data, and especially so for new kinds of data. Each new kind of data is a synergistic increase in dataset value, and I'd much rather see 20GB extra data for each round than forcing someone to run the round again because we didn't collect enough data for their use case
This issue highlights (to me) the need to include some descriptive info in the results.json about the benchmarker used (name, commit, etc). It's a tiny inclusion but could really help future-proof the dataset
For what it's worth, wrk2 is marked experimental. I personally tend to trust projects more that say "I may be missing something important here" than "this will solve all your problems", but we should understand what's meant by experimental. Is it ready for use on our scale?

ceeaspb commented 9 years ago

great - it's partly about more detailed, but also better data. reporting a percentile for response time rather than the average would be a step forward.

it will be certainly interesting to pour over the additional detail. for example "data that shows the tool with the highest sustained throughput at 99.9%ile response time of Xms". A lot of the test results look like they would fit a USL (universal scalability law) model, a lot don't though, suggesting that the selected concurrency levels may need reviewing.

in terms of test types: throughput vs. "raw load" or load vs. stress test - workloads for volume perf tests can be split into 2 over simplified types which map back to queuing theory - open and closed queuing circuits (Gunther2005a @DrQz). open == arrival rate == your "throughput/raw load" closed == concurrency == your "stress"

If most of these frameworks are facing the public internet then they will likely have an open arrival rate applied to them that is independent of their ability to handle it; requests continue to arrive at the same rate however badly the SUT performs. With a closed concurrency workload the request arrivals back off when the SUT degrades resulting in one major source of the co-ordinated omission issue.

@giltene may be able to comment on the readiness or not of wrk2.

giltene commented 9 years ago

Regarding the load bs. stress testing question: wrk2 can be trivially used in stress test mode by setting --rate to e.g. 100000000. The same exact stress level as wrk will then occur, and the latency reports will show abysmal behavior (you will basically see latencies as long as the test run length). While done would contend that such latency output is meaningless (and I agree), it'd no more meaningless that what wrk currently outputs. Any true saturation test will, by definition, report meaningless latency distribution numbers. If it doesn't, it has not really achieved saturation, and is effectively a rare test running at some unknown but very real (loader limited) rate. Wrk3 simply lets you specify what that rate will be.

The "proper" way to use wrk2 is probably to slew rates across runs to determine the sustainable throughout the tested system can sustain at some acceptable latency levels.

Regarding the comparitive question (about comparing wrk and wrk2 for setups they do not exhibit CO, I think that fundamentally impossible, because by its very nature (all out, peddle-to-the-metal saturation test) wrk will currently experience CO in all runs. They doesn't make it invalid as a stress tester. It just makes the latencies it reports invalid.

Regarding readiness: wrk2 is about 4 days old, and I haven't stress tested it much myself. Looking for others to do that. I also hope to push its work back into wrk over time, as optional flags.

hamiltont commented 9 years ago

@giltene Your response answers everything I needed to know (and clarified some other questions I've been pondering), thanks!!

Based on this thread, I also say it's worth giving wrk2 a shot once R10's out. The main issue then is planning how to replace wrk with wrk2

Currently TFB runs tests using this pattern:

primer: wrk -d 5 -c 8 --timeout 8 -t 8 ...
warmup: wrk -d 15 -c 256 --timeout 256 -t 256 ...
benchmark: for c in {8,16,32,64,128,256} run wrk -d 15 -c $c --timeout $c -t $c

All of the benchmark results are focused on max throughput, and I think it's important to maintain that primary metric going forward. If I'm understanding @giltene correctly, we can do something such as this:

primer: wrk -d 5 -c 8 --timeout 8 -t 8 ...
warmup: wrk -d 15 -c 256 --timeout 256 -t 256 ...
benchmark pseudocode:

for c in {8,16,32,64,128,256}: 
    # (1) Find max throughout (ignore useless latency values)
    max_throughput = wrk2 -d 15 -c $c --timeout $c -t $c --rate 100000000000

    # (2) Find latency distribution at max throughput
    latency_distribution_at_max = wrk2 -d 15 -c $c --timeout $c -t $c --rate max_throughout

    # (3) [Optional] Find throughput and latency distribution at a few standard latency levels
    for acceptable_latency in {1, 10, 100, 1000}:
       tput_and_latency_distrib = wrk2 -d 15 -c $c --timeout $c -t $c --rate acceptable_latency

(1) and (2) are the data that we are currently trying to collect using wrk. Hopefully (1) will not change between wrk and wrk2, while (2) will be much more correct and detailed. (3) is new data that we are not currently collecting. Put another way, the results website won't require any modifications as it already expects (1) and (2), while the additional data (3) will just be available in the raw files for anyone interested.

Luckily, this type of change is fairly simple for us, so post R10 if @msmith-techempower is up to keep launching runs on the official hardware we can get a branch with just the wrk to wrk2 changes, run a test on master, and then run a test on the wrk2 branch, and verify that the change looks good to everyone.

Feedback+recommendations on pseudocode?

DrQz commented 9 years ago

@ceeaspb Where are these test results of which you speak: "that look like they would fit the universal scalability law"?

giltene commented 9 years ago

I would consider modifying the test methodology.

The problem I have with the current methodology is that it reports a number that has no real relevance to what the servers can actually bear under load. E.g. what value is there is knowing that net processed almost 200K plaintext ops/sec if some of those ops took over 14 seconds (in a 15 second test), and where 3% of requests failed with a timeout?. What can someone learn from that about the throughput that net can handle in comparison to some other server that did 30K ops per second with some operations taking 7 seconds, and where 2% failed with a timeout? Would anyone deploy either server with those loads per node in mind? What does the data tell us about the behavior of these servers at 20Kops/sec? What does it tell us about the rate at which these servers actually can actually handle in an acceptable setup that you would actually capacity plan for?

My recommendation would be to modify the "score" from "peak number of widgets per second seen passing through system (regardless of latency or failure rates)" to "peak number of widest per second load under which a required service level was maintained for a sustained (many-minutes) period of time".

The service level should be stated with e.g. these steps in mind: 90% (e.g. 10msec) 99.9% (e.g. 200msec) Max (e.g. 2 seconds)

Actual levels can vary per operation. E.g. above may be good numbers for plaintext and son, but DB stuff may need to be more forgiving.

A scripting system can slew the rate on wrk2 tests to find the breaking point for the above criteria, and establish the stable level (you'll want a run that passes, and a run at a high rate that fails). An initial set of short "ranging" runs can be used to guess at the "knee" before starting the longer tests.

A different (yet similar), more visual approach would be to (for each test type e.g. plain text, json, etc.) produce runs at multiple predetermined throughputs on all frameworks, and to capture the percentile spectrum for each frameworks, under each test type, under each fixed throughout load. These captured percentile spectrum logs (in the percentile format HdrHistogram outputs) cab be then used to plot any set of frameworks against each other on a common chart (for a given test type and throughout, or for arbitrary sets). You can even provide a page where you can select the result sets to plot together, providing a convenient and visual comparison of the how various frameworks behavior under the given load. You can see an example of such a chart in wrk2's README.md . And you can find example page with Javascript to read and plot a set of charts here: http://hdrhistogram.github.io/HdrHistogram/plotFiles.html

I realize that this is a significant change to the testing methodology, but I think that it would give much more valuable information for the consumers of your test round information.

hamiltont commented 9 years ago

I would consider modifying the test methodology

You are absolutely not alone on this, but I think the goal for this issue should be focused firmly on replacing wrk with wrk2 to improve latency accuracy. As you say, modifying the methodology is a significant change, and that requires a lot more consensus than modifying the load generator. While I personally think your proposal is a big step in the right direction, I've created an issue #1227 to track it so we can get some broader input.

That being, I'd still like to benefit from your experience here if you can overlook the methodology elephant --I am quite intentionally trying to introduce the smallest changes necessary to replace wrk with wrk2, but I may have made some errors in my translation

giltene commented 9 years ago

benchmark pseudocode script above would work to document the latency behavior at the "detected" max throughput. We should expect that throughput to be similar to what wrk shows. And we should also expect the latency behavior to be very bad. At least at connection counts that are high enough to saturate the server (this is where wrk2's latency reporting may differ significantly from wrk's).

In addition to (or instead of some of) your "standard latency levels", I'd add a couple more: at 80% and at 50% of the max throughput. Those may show healthier latency behaviors than the max.

Remember to include the --latency flag in the tests... It would be a shame to not collect the detailed percentile distributions.

ceeaspb commented 9 years ago

@DrQz http://www.techempower.com/blog/2014/05/01/framework-benchmarks-round-9/

I'm more interested in this context in the ones that don't fit it as it likely points to defects in the test process.

Eg. "Plaintext" test - nearly all the frameworks' throughput data points are constant or degrading, negative returns from incoherency, across the test range. We don't have data points for linear scaling or contention. On 12 Nov 2014 05:03, "Dr. Neil Gunther" notifications@github.com wrote:

@ceeaspb https://github.com/ceeaspb Where are these test results of which you speak: "that look like they would fit the universal scalability law http://www.perfdynamics.com/Manifesto/USLscalability.html"?

— Reply to this email directly or view it on GitHub https://github.com/TechEmpower/FrameworkBenchmarks/issues/1220#issuecomment-62670701 .

ceeaspb commented 9 years ago

@hamiltont re the pseudo code:

for c in {8,16,32,64,128,256}:

the "plaintext" test has a different set of settings, I guess (I am not familiar with the codebase): https://github.com/TechEmpower/FrameworkBenchmarks/blob/e7e77296b7e1da9057f75c069e968368f745f91c/toolset/benchmark/framework_test.py#L518

item in [256,1024,4096,16384]

vs. the other tests' concurrency levels: https://github.com/TechEmpower/FrameworkBenchmarks/blob/79c60e5bf0b85c0a594c9f0148a47d49a64940e9/benchmark.cfg.example#L18

The problem I mentioned with the plaintext test is that a large proportion of the frameworks have their max throughput at the lowest connections of 256 ( http://www.techempower.com/benchmarks/#section=data-r9&hw=peak&test=plaintext ). This may explain the poor average latencies of >1 second to return a simple response for most of the frameworks.

so, Ideally the pseudo code procedure would ensure: 1) that there are at least some data points that demonstrate linear or near linear scalability. Else the peak may be on a lower connections. One approach for the "plaintext" test - use a similar range to the other tests eg 8,32,128,256,512. 2) if the maximum throughput is at the last data point (highest connections, eg cpoll_cppsp and netty on json serialization) then the framework may have not reached it's peak throughput and not reached levels of load to provoke contention and coherency behaviour. if so, the test could continue until that happens.

DrQz commented 9 years ago

@ceeaspb I don't see what these Plaintext data have to do with the USL; as they stand. Every web app "framework" listed is a completely different context (like the difference b/w say, an Oracle RDBMS and memcached), not incremental load points. So, where are you seeing coherency degradation?

For the USL to apply, you need to plot the steady-state throughput X(N) as a function of the number of load generators (N) for a particular single framework. In other words, there needs to be a whole series of throughput values for any and all of the listed frameworks. What is shown (as far as I can see) is just a single throughput value (for each framework) corresponding to some unknown (to me) number of load generators.

ceeaspb commented 9 years ago

@DrQz apologies, could you click on the "Data table" tab, this shows the throughput, X(N), at different connections, N. I think! eg.

Framework   256 1,024   4,096   16,384  Best    Cls Lng Plt FE  Aos IA  Errors
activeweb   339,627 302,862 272,321 261,618 339,627

So for "activeweb" N's lowest value is 256 with an X(N) of 339k, the highest ("best") value in the range of N tested. Preferably there would be data points for lower values of N here to see (my point is there are missing data points).

DrQz commented 9 years ago

@ceeaspb OK, I see now. Big 10/4 on that. Wonder why?

In case it's useful to anyone here trying to debug the issue, from the USL perspective, that kind of negative scalability is associated, not with contention for shared resources (i.e., queueing delay) but with point-to-point exchange of data updates or messages or whatever, between non-local or distributed resources (e.g., caches), in order to reach some kind of consistency or data coherency. See Figure D. The associated quadratic growth in delay to reach consistency causes the throughput to degrade like 1/N (the number of load generators). This, of course, was your point.

bhauer commented 9 years ago

Although the following is not really relevant to the discussion at hand, I figured I'd provide it for some context.

The plaintext test type was conceived to satisfy several maintainers and fans of high-performance platforms who collectively wanted to demonstrate the capacity of those platforms. Especially desired were higher concurrency levels and HTTP pipelining.

For the sake of total suite execution time, all tests presently run for 15 seconds. Beyond that, a good deal of additional time is consumed in inter-test cooldowns and process cycling. We are steadily working toward a continuous benchmarking configuration which would afford us the luxury of increasing the test execution times (in part because suite-breaking changes would be revealed very quickly by the continuously-running environment).

Since the addition of the plaintext test type, we have seen the contribution of plaintext test implementations from mid-range and low-performance frameworks, several of which suffer serious capacity issues at high concurrency levels. The current results data are rendered somewhat weaker since many frameworks being tested reach a "panicked" or broken state due to load beyond their capacity. In fact, just last night, @msmith-techempower discovered that the run he kicked off is scrap because the plaintext test type ran first, leaving many implementations unable to respond to any other tests.

Frameworks that perform best at the 256 concurrency level are likely suffering queuing of requests at higher concurrency levels, reducing their overall throughput as concurrency increases. In the bar chart, we simply render the best performance achieved by that framework from the concurrency levels we sampled.

To-date, our project has not been intended to exercise the queuing mechanism of web servers, but rather to measure (roughly) the highest throughput possible for each framework as a high water mark. For this reason, I have historically argued against the value of adding higher concurrency levels to more computationally intensive tests such as Fortunes. Doing so would end up exercising the web servers' request queues. To be clear: I am not suggesting there is zero value in exercising servers' request queues, but rather that the marginal utility is lower than other enhancements we have in our backlog especially when we consider that adding more concurrency levels increases total suite execution time (at least when using wrk).

What we have been missing in our load generation tool is a means to dynamically increase the concurrency during a load simulation until specified thresholds have been met. We have discussed the desire to do this elsewhere (I believe in a GitHub issue, but I have not searched).

I am not certain if wrk2 provides this capability, but what I would like is the ability for the load generator to continue to increase concurrency gradually until a specified percentage of requests are above a specified latency threshold.

For example, I would like to know how many concurrent requests a framework can process if my business requirement is 99.9% requests must be fulfilled in 100ms or less. Starting at 256 concurrency, the load generator may be satisfied and ramp up to 272 concurrency and so on.

Perhaps this is precisely what comments above are describing, but again I'm not certain.

All that said, to set expectations: while this is definitely a very worthy topic to continue discussing, it is not something we are considering for Round 10.

ceeaspb commented 9 years ago

@bhauer thanks for your additional info. I have been reading the tickets on difficulties stopping test etc with interest.

I will continue discussion on the topics in separate issues or on the google group, so this can get back to just being about whether and when (not round 10!) to try wrk2.

hamiltont commented 9 years ago

@ceeaspb if you have any advice on the halting problems, feel free to reach out on IRC. We're on freenode at #techempower-fwbm discussing it daily, @msmith-techempower especially

Also, feel free to skim our convo on the problems at botbot.me/irc.freenode.net/techempower-fwbm/

wg commented 9 years ago

Hello folks, FYI wrk 4.0.0 has been released with fairly dramatic improvements to performance and stats accuracy. I've eliminated sampling and now record all latencies below the configured timeout, and CO correction takes a similar approach to the one used by HdrHistogram.

I'd encourage you to take advantage of the Lua scripting support to generate your output files in a more useful format, perhaps JSON. It's much easier than parsing the human-friendly output and you can get any percentiles you desire: https://github.com/wg/wrk/blob/master/scripts/report.lua

bhauer commented 9 years ago

@wg This sounds great. Thanks for letting us know! Hopefully we can find some time between Round 10 and 11 to experiment with wrk 4.

bhauer commented 8 years ago

I've changed this to a Round 12 milestone.

msmith-techempower commented 8 years ago

Resolved by #1864

TechEmpower / FrameworkBenchmarks

consider using wrk2 for raw or CO corrected percentiles #1220