TechEmpower / FrameworkBenchmarks

Source for the TechEmpower Framework Benchmarks project
https://www.techempower.com/benchmarks/
Other
7.58k stars 1.94k forks source link

Socket errors not accounted for #117

Closed bhauer closed 11 years ago

bhauer commented 11 years ago

In our extraction of numeric values from the raw results from the Wrk load simulation tool, we are not accounting for socket errors. These tend to appear when the platform or framework is saturated. Not dealing with them could mean the resulting request per second and latency numbers are not correct.

See for example the 256-concurrency run for Go 1.0.3 on EC2:

https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/results/ec2/20130404200504/json/go/raw

Making 100000 requests to http://10.253.68.28:8080/json
  4 threads and 256 connections
  Thread Stats Avg Stdev Max +/- Stdev
    Latency 7.81ms 6.15ms 26.54ms 50.00%
    Req/Sec 2.94k 2.58k 7.00k 46.75%
  100000 requests in 7.26s, 13.35MB read
  Socket errors: connect 0, read 0, write 0, timeout 53
Requests/sec: 13769.23
Transfer/sec: 1.84MB

At first glance, I suspect this behavior could be avoided by increasing the timeout period for requests. To that end, I have submitted a few questions related to timeout to @wg on the Wrk repo: https://github.com/wg/wrk/issues/28

I've updated the blog entry to mention this problem at the top of the entry. Once we can work around the timeouts (presumably by increasing the timeout period generously), we will need to re-run any affected tests.

jameslyo commented 11 years ago

Another solution (more expensive, but not requiring a change to Wrk) might be to increase concurrency to a point where the average latency for the framework begins to skew (or begins to show timeout behavior), and then measure the throughput and latency at that level (just before breakdown)

That said, the cleanest change would be what you're proposing -- which is to just let the timeout be much much longer...

bhauer commented 11 years ago

Thanks @jameslyo for pointing out this issue in the first place. I'll wait to hear what the Wrk author has to say about timeouts and we'll proceed from there.

bhauer commented 11 years ago

Hi @jameslyo. If you have not already had a chance, take a look at the response from @wg: https://github.com/wg/wrk/issues/28 .

If I understand it correctly, the requests that triggered the timeouts warnings are allowed to proceed to completion. That is, in the report above, 53 requests took longer than Wrk's built-in 2 second timeout, but those connections were not closed, reset, or the data lost.

I then asked why the maximum latency doesn't show a number higher than 2 seconds in that case, and it sounds like the maximum latency being reported is not the definitive maximum but rather a random sample higher than previous random samples. Reviewing it again this morning, I'm frankly a little fuzzy on this still.

Your thoughts?

jameslyo commented 11 years ago

Well... after reading it. I suppose I understand not wanting to bookkeep every connection for perf reasons. (it IS a load generator after all) but at the same time, it makes the result really hard to understand. (or even be meaningful) I sort of wish it could get measured per connection... or at least logged cheaply and calculated after the fact. (that would allow for distributions -- medians, 90th % etc -- which everyone would love too) But I don't know if the goals of this Wrk tool are the same as ours.

If I read that right, every 100ms, 1 connection is sampled. It took 7.26s to run. So that means it only sampled about 72 requests per thread (4 threads, ~ 288 samples) to compute the latency stats. So its not a representative distribution at all. Its really too bad its sampled based on time, and not on number of connections. Because that means that fast performing servers, have poor sampling accuracy. Which makes comparing Go with Cake on latency even less meaningful. (comparing go 16 concurrency with go 256 on the other hand is at least apples to apples)

The only way to normalize this (without tool modification) is to estimate the number of requests being sampled, and normalize that. Said another way -- run all tests for as close to the same length of time as possible. Which I imagine is a function of guessing how long x requests will take, and solve x for time = the duration of the slowest test. -- that would give you similar sample sizes at least and make the comparison more valid.

On Tue, Apr 9, 2013 at 7:23 AM, bhauer notifications@github.com wrote:

Hi @jameslyo https://github.com/jameslyo. If you have not already had a chance, take a look at the response from @wg https://github.com/wg: wg/wrk#28 https://github.com/wg/wrk/issues/28 .

If I understand it correctly, the requests that triggered the timeouts warnings are allowed to proceed to completion. That is, in the report above, 53 requests took longer than Wrk's built-in 2 second timeout, but those connections were not closed, reset, or the data lost.

I then asked why the maximum latency doesn't show a number higher than 2 seconds in that case, and it sounds like the maximum latency being reported is not the definitive maximum but rather a random sample higher than previous random samples. Reviewing it again this morning, I'm frankly a little fuzzy on this still.

Your thoughts?

— Reply to this email directly or view it on GitHubhttps://github.com/TechEmpower/FrameworkBenchmarks/issues/117#issuecomment-16115282 .

bhauer commented 11 years ago

Now that we have discussed this a bit, here is my take. On one hand I feel this is an important matter we should aim to resolve as soon as possible. However, on the other hand, I feel that with this latency data and the concerns we now have about it, it is quite easy to lose sight of the forest by looking at the trees.

We know Go's latency is going to be dramatically lower than CakePHP's and so the data we get out of Wrk is not surprising. But is Go's latency actually better than Netty and Servlet? That's something we now realize we cannot be as sure about given the clarity of the data we've got from Wrk. All three are so quick that the random sampling may miss specific outlier requests that cause the "sniff check" (divide total execution time by number of requests completed to determine average latency) to fail.

For our purposes, I feel in the short-term we can just include messaging that explains the latency data is a sampling of random requests and may be especially inaccurate for high-performance frameworks due to the small number of samples taken.

I don't feel that takes away tremendously from the value of the benchmarks as a whole since I doubt many people will apply a great deal of decision-making weight in 0.1 to 0.5ms latency variances. However, they should put value in 25ms, 100ms, or 500ms latency variances, and about those we are more confident.

All of this loops back to an underlying curiosity I had when readers asked for latency numbers: how much value does latency truly give you versus requests per second? Yes, in some frameworks' cases, the maximum may be a little more disproportionate to its average than nearby frameworks. Or perhaps its standard deviation is a little greater than surrounding frameworks. This may be relevant if your system absolutely must consistently respond in, say, 5ms. But few of us are building those systems. Most of us just want to build "apparently-fast" web applications.

In other words, I'd rather use a framework that has an average of 5ms with a rare maximum of 100ms than one that has a consistent 50ms latency. The web already has too many other variables adding a few milliseconds here and there to worry about rare maximums that are still quite low in the big picture.

(That said, I realize that the particular case we are discussing, Go 1.0.3 at 256 concurrency, may in fact have some requests at greater than 2,000ms that are missed by the Wrk latency sampler, suggesting perhaps a garbage collection hiccup.)

I would like to eventually understand why when Wrk completes each request it cannot simply add to its accumulation of aggregate elapsed request time for that thread (which will eventually be divided by that thread's number of completed requests to compute average latency) and set-if-greater to store a maximum latency for that thread. From my naive point of view, only the standard deviation would require a full sample space to compute and therefore might require concessions like the sampling approach to maximize performance.

jameslyo commented 11 years ago

Why latency matters --- generally speaking, when performance tuning a system -- you're looking at outlier performance. The average is not that meaningful. Example:

1 out of 100k requests takes 100k seconds. The rest take 5ms.

The reported average latency is now 499995 ms + 100000000ms/100k = 1.0049 sec.

That looks slow. Call in the engineers, we have to optimize! (or maybe not....)

If you looked a a distribution -- you'd see: 25th 5ms 50th 5ms 75th 5ms 90th 5ms 99th 5ms 999th 5ms max: 100k sec Average: 1 s

And you would know without a moments hesitation that the system is performing fine. the average is easy to ignore when you can see that there

is clear outliers that skew it.

The same is true in reverse, the average could be skewed low by fast requests hiding a substantial number of users suffering crappy performance.

If you have good average latency you could still have crappy 90th percentile latency. And that means that 10% of your customers are seeing something bad. Thats not good for reputation.

Without looking at the distribution, its really hard to draw any conclusions at all from a mean. So ... while I find latency interesting "in general". I think presenting it in this context is tricky to interpret. But now that I know how its calculated -- it may have more value than I first thought.

Its also true that users notice latency variance more than latency itself. When something is sometimes fast, and sometimes slow -- they will notice the slow more than if it was always a little slow.

As for the tests -- I agree as the throughput rises, the latency comparison becomes increasingly suspect. Due to reduced sampling. So rank ordering them is... perhaps unfair. Maybe you could present it as you suggest using buckets. Latency below 10ms, below 50ms, below 100ms, 500ms, 1s... etc, and declare every thing in that bucket "tied". It might be more fair than declaring Go to be superior to Netty when the throughput numbers suggest that "on average" that isn't the case.

That said, the sampling pulls out the outliers and tells you something about what "typical" latency might be. While less rigorous than a distribution -- i'm coming to the conclusion that its probably indicative of what you're "likely" to see as a latency value.

In the case of Go -- it points to a few requests taking a very long time. (skewing the averages) But most completing fast. In the case of netty -- same user experience and given the throughput superiority -- the tail of the distribution is no doubt smaller.

Its all a bit like tea leaf reading without the distribution. So actually now that i know its sampled -- I almost prefer this because it gives an independent axis for looking at the distribution. If it was every connection and an average, it would be no more informative than the throughput numbers.

On Tue, Apr 9, 2013 at 11:29 AM, bhauer notifications@github.com wrote:

Now that we have discussed this a bit, here is my take. On one hand I feel this is an important matter we should aim to resolve as soon as possible. However, on the other hand, I feel that with this latency data and the concerns we now have about it, it is quite easy to lose sight of the forest by looking at the trees.

We know Go's latency is going to be dramatically lower than CakePHP's and so the data we get out of Wrk is not surprising. But is Go's latency actually better than Netty and Servlet? That's something we now realize we cannot be as sure about given the clarity of the data we've got from Wrk. All three are so quick that the random sampling may miss specific outlier requests that cause the "sniff check" (divide total execution time by number of requests completed to determine average latency) to fail.

For our purposes, I feel in the short-term we can just include messaging that explains the latency data is a sampling of random requests and may be especially inaccurate for high-performance frameworks due to the small number of samples taken.

I don't feel that takes away tremendously from the value of the benchmarks as a whole since I doubt many people will apply a great deal of decision-making weight in 0.1 to 0.5ms latency variances. However, they should put value in 25ms, 100ms, or 500ms latency variances, and about those we are more confident.

All of this loops back to an underlying curiosity I had when readers asked for latency numbers: how much value does latency truly give you versus requests per second? Yes, in some frameworks' cases, the maximum may be a little more disproportionate to its average than nearby frameworks. Or perhaps its standard deviation is a little greater than surrounding frameworks. This may be relevant if your system absolutely must consistently respond in, say, 5ms. But few of us are building those systems. Most of us just want to build "apparently-fast" web applications.

In other words, I'd rather use a framework that has an average of 5ms with a rare maximum of 100ms than one that has a consistent 50ms latency. The web already has too many other variables adding a few milliseconds here and there to worry about rare maximums that are still quite low in the big picture.

(That said, I realize that the particular case we are discussing, Go 1.0.3 at 256 concurrency, may in fact have some requests at greater than 2,000ms that are missed by the Wrk latency sampler, suggesting perhaps a garbage collection hiccup.)

I would like to eventually understand why when Wrk completes each request it cannot simply add to its accumulation of aggregate elapsed request time for that thread (which will eventually be divided by that thread's number of completed requests to compute average latency) and set-if-greater to store a maximum latency for that thread. From my naive point of view, only the standard deviation would require a full sample space to compute and therefore might require concessions like the sampling approach to maximize performance.

— Reply to this email directly or view it on GitHubhttps://github.com/TechEmpower/FrameworkBenchmarks/issues/117#issuecomment-16131000 .

bhauer commented 11 years ago

I don't mean to imply latency information has no utility. However, the value of latency information was, in my opinion, exaggerated. I am glad to have it now, with the caveats we have established with respect to sample resolution.

I feel no results in the latency chart came as a surprise given what we had observed from first-round request-per-second data. Sure, there are a few positions exchanged when comparing the RPS and latency rank orders, but nothing significant enough to catch my eye.

Even the highest standard deviation observed (7.5 seconds on a 9.5 second average for Django with 20 queries per request) isn't particularly notable in my opinion. Although, again, one exception is the one this particular discussion has unearthed concerning Go 1.0.3 at 256 concurrency.

As a note of related interest, the Django results in question not only include thousands of "timeouts" but also many read errors. E.g., from 20 queries on EC2:

Socket errors: connect 0, read 738, write 0, timeout 97589

Pat and I are discussing capture of read, connect, and write errors for rendering into the charts. These errors should be displayed since they represent a server that is breaking under load, not just processing requests slowly.

jameslyo commented 11 years ago

For the purposes of this benchmark -- I agree with you. Latency should be consistent with such a homogenous request load. So fretting about it is probably not that important. standard deviation isn't the same as looking to the 90th %, but its still a variance measure, and so better than nothing. In practice -- I think I have trained myself (and others are no doubt in the same boat) to dismiss averages as close to meaningless. Because in most practical scenarios... they are not what one should be looking at.

I think we're on the same page at this point. I appreciate you looking into this issue and being so open with all the work you are putting into this.

Capturing error output is a good idea too. It might be worth disqualifying an achieved throughput if the system was producing errors. In practice you would be reacting to such errors in production by adding nodes.... to bring the system back to a happy place. So the practical boundary is however much concurrency (throughput) you can achieve before throughput starts to drop (latency rises), and/or error rates jump up.

On Tue, Apr 9, 2013 at 1:52 PM, bhauer notifications@github.com wrote:

I don't mean to imply latency information has no utility. However, the value of latency information was, in my opinion, exaggerated. I am glad to have it now, with the caveats we have established with respect to sample resolution.

I feel no results in the latency chart came as a surprise given what we had observed from first-round request-per-second data. Sure, there are a few positions exchanged when comparing the RPS and latency rank orders, but nothing significant enough to catch my eye.

Even the highest standard deviation observed (7.5 seconds on a 9.5 second average for Django with 20 queries per request) isn't particularly notable in my opinion. Although, again, one exception is the one this particular discussion has unearthed concerning Go 1.0.3 at 256 concurrency.

As a note of related interest, the Django results in question not only include thousands of "timeouts" but also many read errors. E.g., from 20 queries on EC2:

Socket errors: connect 0, read 738, write 0, timeout 97589

Pat and I are discussing capture of read, connect, and write errors for rendering into the charts. These errors should be displayed since they represent a server that is breaking under load, not just processing requests slowly.

— Reply to this email directly or view it on GitHubhttps://github.com/TechEmpower/FrameworkBenchmarks/issues/117#issuecomment-16139384 .

bhauer commented 11 years ago

Agreed. If we can figure out a nice way to include the count of errors for each test in our tables, I think the reader will be able to comprehend, for example, "with 20 trivial queries per request at X client concurrency Django without a connection pool starts falling over."

Thanks for your feedback, James! If you have other questions, thoughts, criticisms, etc. please feel welcome to submit them.