Closed luontola closed 9 years ago
Hi @orfjackal. Thanks for the suggestion!
I feel what you propose would best leverage an iteration of Wrk (or perhaps the better word would be permutation, given how the author of Wrk likes to develop enhancements). Such a new iteration/permutation could provide the ability to ramp up concurrency automatically until a target latency is measured, backing off as necessary if it overshot. That would be slick.
However, while I don't necessarily disagree with Gil, I also feel that criticisms of throughput benchmarks as useless are over-stated. Our data confirms: inability to meet reasonable service-level expectations is strongly correlated with low throughput. Similarly, high throughput is strongly correlated with low latency.
If a reader has a throughput target for her system, our exercise of framework fundamentals won't answer her question fully even if we do set latency targets for our test types. She will need to build a prototype of her application and measure that in order to know if her target will be met. What our results can provide such a reader--as they exist today--however, is insight into which platforms or frameworks perform fast and which do not. Which are suitable starting points for her application prototypes?
It is true that a throughput test exercises a server and application in an unnatural state of complete resource saturation (be it network or CPU). I agree that it would be interesting to see each framework's ability/inability to meet a variety of service-level targets after having established, for example, a fictional need to reply with a "Hello World" in less than 1,000ms. On the other hand, the quantity of additional test runs and captured data to pull that off is fairly large.
Stepping back for a moment, as much as it is fun to see the platforms and frameworks in top slots jockeying for the podium positions, this project isn't really about that fine-grain detail. We started this project because we had experienced a dramatic differential in performance when comparing some frameworks on real projects. By dramatic, I don't mean a 50% performance differential, but 500% or more.
We wanted to measure that differential more objectively and were (pleasantly/unpleasantly) surprised at how wide the differential really was.
The data informally establishes rough tiers of frameworks. For example: high-performance, moderate-performance, and low-performance. Is Undertow really faster (or put even more obtusely, better) than Spray because it eked out 7% more JSON requests per second in Round 6? In light of how dramatically both Undertow and Spray devastate Sinatra--just as an example--the answer is: no, not really.
When I've been asked how to interpret the results I caution against getting too mired in small percentage differences here and there. These will fluctuate a bit due to environment randomness and then a little bit more as test implementations are improved in time by subject-matter experts. But on the whole, the rough tiers have remained largely undisturbed for six rounds, and I suspect they will remain so for the foreseeable future. The most exciting developments will be when a framework jumps wildly upward on the charts or takes a big dive as a result of a new version or a test implementation change. But those will be comparatively rare.
I prefer to encourage people to evaluate the results with a union of their preferences (e.g., "Must be a JVM language"); performance requirements in rough terms (e.g., "performance is important" or "performance is desirable" or "performance would be nice, but not important"); and a review of the matching test implementations' source code (which I feel give good insight into what it's like to actually work with each framework).
My opinion is that I like your proposal but unless it can be made fairly easy to implement, it's a "nice to have." I feel throughput benchmarks, especially when combined with latency data and error reports as we have collected via Wrk, are a viable measure of performance. If we invest effort into going to the "next level" of detail, I would prefer that next level be in creating a more realistic and holistic test type.
All that said, if anyone has the time and interest to attack this proposal, I'm all ears! Please, continue this discussion!
FWIW, I completely agree with the basic argument that the OP (Gil Tene) was making: measuring throughput alone is of limited value, and measuring latency via averages is also of limited value. Certainly better than nothing, but far from optimal!
When measuring latency, one wants to look at the distribution of response times over a sliding window and examine a [high] percentile of said distribution. I.e., it would be wrong to back off just because there's a single slow response out of many tens of thousands of fast responses.
The trouble, then, is that one now has several heuristics:
As a first step, I would recommend ignoring the backoff issue. This eliminates the need for a sliding window or an acceptable latency value (used for backoff, I mean). The benchmark suite would then measure multiple high latency distribution percentiles and go from there; I'd recommend 95, 99, and 99.9 (what we cared about the most at Google, FWIW).
If it turns out that most of the frameworks have lousy high-percentile latency, then maybe it's worth doing the more elaborate thing with backoff, thresholds, etc; but this is all a little complex, as not all services are user-facing things designed for low latency. Proponents of a framework designed for throughput rather than latency would be understandable irritated with the benchmark suite for hamstringing them.
My two cents.
Here is a relevant presentation about how to and how not to measure latency: http://www.infoq.com/presentations/latency-pitfalls
In my consulting work I have spent an enormous amount of time reviewing the results of performance tests. I think that its pretty common to see tests that are near-useless because of methodological problems. The most common problems being:
I'm a big fan of the Tech Empower benchmarks because they are open, reproducible, and pragmatic. I run the benchmarks on my own physical and virtualized servers and I've learned a lot from them. One potential risk with this test design is that an unstable system might appear healthy for 60 seconds and then fail after a few minutes, because of a shortage of ephemeral ports, or some other resource
As for Gil Tene's point about latency, I think that this is best considered in the same way as test runs that have error results. If you look at the Round 6 results for JSON on physical hardware then you see that almost every framework has no errors, and that the slower frameworks (with one exception) have average latencies under one second with SD under one second. I'm comfortable describing these as "healthy results" within the context of user-facing web applications (Not ideal, I'm a latency Nazi, but I'm a realist too). For most of the test results, its clear that the performance knee occurs before we reach the 256 thread concurrency.
But if you look at the Plaintext results, which go out to concurrency of 16,384 threads, you see a very different picture:
I repeated the servlet plaintext test on my VPS a dozen times, and then again with a duration of five minutes. I found that peak request rate for the five minute duration was 30% lower than the one minute duration. That suggests that the one minute servlet plaintext tests, in my environment are not tests of a stable system. I don't know for certain whether the same is true in other test runs.
I agree that the real value of this test bed is in identifying the rough tiers of frameworks, and that micro-differences really aren't important or meaningful. That doesn't mean that the latency suggestion is a nice to have. Test results from a an unstable or broken system aren't very meaningful. I suspect that some of the round six plaintext results fall into that category, and I think that there is great value in distinguishing between unstable and stable results. I think this is an interesting question thats bigger than simply latency.
One way to ask the stability question is to measure whether these request rates can be sustained for longer durations. Another way is to validate whether results make sense when measured against a model like Gunther's Universal Scalability Model (See http://tinyurl.com/kx7g3j7)
I've taken the Round 6 Tech Empower i7 test results for servlet JSON and plaintext and fitted both to the USL in Excel. You can see that whilst the servlet JSON data looks sensible http://postimg.org/image/ze5geh2rb/ but the servlet plaintext data looks "bad" http://postimg.org/image/ale83dott/
Not sure whether either of these approaches are feasible within your current workflow, but I think its important to separate the "good tests" from the "bad."
@pbooth Thank you for your thoughts on this! It's great to get input from experts in the area at hand.
I really like the point you make about the stability of the environment. We have in fact observed some behavior indicative of environment instability, leading to tactics such as restarting the database servers between tests. We'd like to improve stability as much as feasible.
To date, we've configuration instability of our own making to deal with. In each round we have made one or more tweaks to the environment that have necessitated some trial and error before the data collected appeared sufficiently sound to share. For instance, in Round 2 we changed the load simulation tool from WeigHTTP to Wrk. In Round 6, we added HTTP pipelining and higher request concurrency.
For Round 7 (no ETA, sorry!), we aim to add a physical (i7) Windows environment to the hardware. But I assume that at some point, we'll stop tweaking the configuration and get into a rhythm.
To minimize the time spent dealing with configuration changes, starting with Round 7, we plan to adopt a less forgiving round-over-round progression. In other words, we want to minimize the amount of time we spend auditing results, effectively spreading that responsibility among the community. If specific tests' data looks incorrect, rather than defer posting the results, we'd rather acknowledge the problem and collectively aim to resolve it for the next Round.
Assuming we gain efficiency from that, I think we could return to longer test runs, at least for especially volatile test types such as Plaintext. We have been driven to reduce the duration of test runs as the breadth has grown in an effort to keep the whole run something that can complete overnight. But if we aren't worried about having to repeat overnight tests several days in a row, a longer runtime isn't really a concern.
I am curious if anyone reading has tips for resetting the state of a machine in between tests. We presently have a brief pause between tests, but as @pbooth points out, there is still a potential for state leaking across tests, as by sockets on ephemeral ports. Are there tools or tactics that would, for example, allow us to reset the TCP stack in between tests?
I'd like to hear other recommendations as well!
@bhauer Regarding running out of ports: I'd suggest just configuring the machine properly in the first place so that you don't run out of ports and the various backlog parameters. Then if you think you're getting into some bad state with stale connections, just run netstat in a loop until it's all cleared out.
Closing this issue due to inactivity. Reopen this issue if this is something we should continue to consider.
For future readers, also see #854, #1220, #926, #1227
Quoted from Gil Tene's message at https://groups.google.com/d/msg/mechanical-sympathy/ukY80mtJXEg/3NaVonuzSGEJ
How about adding something like that? Right now the benchmarks report max latencies of even over ten seconds, which would be completely unaccetable to a typical web site. Measuring sustainable throughput would give more useful information to those evaluating different frameworks.
For example measure the throughput that can be achieved while keeping max latency below X ms. The measurements should be done at various max latency limits, so that the people reading the benchmark results may themselves decide what is acceptable latency to them (give them a configurable option and/or improve the data visualization). The measurement could perhaps be implemented by slowly increasing the number of concurrent requests (starting from 1) and recording the max latency and throughput at each step.