TechEmpower / FrameworkBenchmarks

Source for the TechEmpower Framework Benchmarks project
https://www.techempower.com/benchmarks/
Other
7.66k stars 1.95k forks source link

Questions about the request timeouts used for benchmarking #1052

Closed hamiltont closed 6 years ago

hamiltont commented 10 years ago

According to #855, Wrk associates each request with a time limit, such as 5 seconds. If a request returns after the time limit, it is counted as an error. This was apparently a problem for benchmarking frameworks under heavy load conditions, because once the request queue was too full all requests were being counted as failures by wrk, as all requests were waiting in the queue more than 5 seconds while previous requests were being processed.

This would be hugely frustrating, because the longer this heavy load benchmark runs the worse your server appears to perform. This is true even if it was doing as much work is possibly with the current CPU--once your queue is in a very full state you are done. Apparently, there's no way to saturate+back-off+repeat, but only "hit the server as hard and fast as you can."

The solution to this problem was to make the timeout wrk uses for each request equal to the concurrency. That means that at concurrency level 256, each request that completes within 256 seconds! is counted as a successful request. That's over 4 minutes allowed for a single request!! The code looks like this:

{wrk} {headers} -d {duration} -c {max_concurrency} --timeout {max_concurrency} <snip>

I wanted to open an issue to discuss problems and potential different approaches.

Problems:

Solution thoughts:

msmith-techempower commented 10 years ago

From my point of view, the current implementation is fine.

Essentially, if a site using FrameworkX is under high load (getting /.d for example) and I am a user trying to visit this site, if I get anything more than a few seconds of waiting, I am going to cancel my request and try again immediately a few times, then give up for a very good while.

In no conceivable universe could I see myself waiting 120 seconds for a site to load.

As a developer of web applications, I think that 5 seconds is probably still a little bit long when considered production grade deployments. In my experience, if a user has to wait more than 1 second, it is considered a poor experience.

To be fair, we are doing a stress test with these benchmarks. We are expecting that highly optimized stacks will perform better under these extreme load conditions while lesser performant ones will suffer. To that end, I think that this discussion should be shifted away from arguing magnitude (how do you get every request to return without timing out) and more toward the most appropriate way of measuring what we believe we are measuring.

hamiltont commented 10 years ago

the current implementation is fine...no conceivable universe could I see myself waiting 120 seconds

The current implementation waits this long, so how is that fine?

msmith-techempower commented 10 years ago

The current implementation waits this long, so how is that fine?

Clearly, I have misread this post. Chalking it up to "it's early" and tagging @bhauer in.

hamiltont commented 10 years ago

Ah ok. That explains why I was really confused by your reply!!

EDIT: I looked back and realized the "problem" was mired in a wall of text, so I bolded it

bhauer commented 10 years ago

My opinion is that we should set a "reasonable" timeout such as 5, 10, 15, or 30 seconds.

However, I don't believe we're doing anything with the timeout information presently. As I understand it from @wg, requests that exceed the timeout duration are simply indicated as such by Wrk, but these requests are not terminated. Ultimately, when they finally do complete, Wrk will include them as a completed (successful) request.

As such, we're not treating timeouts as if they were errors.

hamiltont commented 10 years ago

but these requests are not terminated

Huh, that seems to be true. It appears from a quick look that wrk just examines the open connections once every --timeout seconds and increments the error count for any connections open longer than this. I was not expecting that :-P

Anyways, there are two thoughts from this. We either want to always set the timeout very high, so that we don't loose any performance from wrk iterating over the open connections frequently, or we want to set the timeout to a reasonable value so that it's at least displaying some useful information in the detailed log output. At the moment this variable timeout approach does not make much sense to me

I'm for changing this to a reasonable value as well, e.g. 5 15 or 30

methane commented 10 years ago

This is loadtest vs benchmarking (throughput) issue.

If it's a loadtest, requests should be sent on constant rate. When load is too high, server can send 503 error soon, and client doesn't send next request soon.

But current wrk sends next request soon just after receiving 503 error. Server resource is consumed for making tons of errors. It's because wrk is designed for benchmarking throughput.

If you want small (realistic) timeout, how about adding sleep after receiving 50x response? Sleep time should be same as timeout, I think.

methane commented 10 years ago

but these requests are not terminated

Huh, that seems to be true. It appears from a quick look that wrk just examines the open connections once every --timeout seconds and increments the error count for any connections open longer than this. I was not expecting that :-P

Uh, I was wrong. Sorry. According to this behavior, I agree with using shorter (3 or 5 secs) timeout.

msmith-techempower commented 8 years ago

Still not sure where we stand on this one - kicking down the road.

msmith-techempower commented 6 years ago

Here we are two years later, and I'm still not sure where we stand on this one.

It LOOKS like our timeout is now set to 8 across the board, so I am going to close this and if anyone disagrees, feel free to reopen.