TechEmpower / FrameworkBenchmarks

Source for the TechEmpower Framework Benchmarks project
https://www.techempower.com/benchmarks/
Other
7.64k stars 1.94k forks source link

Octane/Round 14 should be considered stripped #2783

Closed fredrikwidlund closed 7 years ago

fredrikwidlund commented 7 years ago

Please note that "Octane" (currently #1 in plaintext) does not even attempt to implement the HTTP protocol at all. It works as a TCP daemon that count the X occurrences of "\n\r" (supposedly unique as part of the double newline \r\n\r\n) to assume a request is finished, assumes a specific request without parsing the input, and dumps X times a raw prebuilt "reply"

methane commented 7 years ago

How about just remove it?

I feel "stripped" approach means disable some option or middleware in the framework which usually enabled in production.

NateBrady23 commented 7 years ago

@simongui Can you speak to this please?

stevehu commented 7 years ago

There are several frameworks only implement plaintext so they are not using any routing rules to find the handler. It is unfair to other frameworks which implement all tests as routing takes a lot of CPU time. I think we should remove them or add an indicator for partial implementation. What do you think?

methane commented 7 years ago

I dislike personal projects which has no (or only one) production usage are listed and maintained. Octane seems personal, experimental, one-month project.

stevehu commented 7 years ago

@methane I totally agree with you. It is not a framework as there are only two files in the project.

fredrikwidlund commented 7 years ago

@stevehu libreactor implements json and plaintext with a request routing layer. What do you mean with this?

  reactor_http_server_open(&server, NULL, &server, "0.0.0.0", "8080");
  reactor_http_server_route(&server, plaintext, NULL, "GET", "/plaintext");
  reactor_http_server_route(&server, json, NULL, "GET", "/json");
  reactor_core_run();
fredrikwidlund commented 7 years ago

My main issue with Octane is that it does not parse HTTP. I believe that even with the current rules it should be clear that the different frameworks are expected to actually implement HTTP to some degree.

stevehu commented 7 years ago

@fredrikwidlund Thanks for the clarification. Most framework will have a routing layer to handler HTTP request. The most simple one will be hard-coded if/else but a real router should support pattern matching in order to parse uri parameters. In later case, the sequence of the routes and the number of the routes will impact performance especially plaintext as the handler is too small. That is the reason I recommend to have flag to indicate partial or full implementation.

fredrikwidlund commented 7 years ago

@stevehu I understand the reasoning, agree that request routing should be included. However I do not understand why you include libreactor (my framework) in this context?

stevehu commented 7 years ago

@fredrikwidlund I didn't read the code but just looked at the test result to check if the framework is partial implementation or not. My main point is about partial implementation not about if the framework has a routing layer or not. libreactor only implements json, plaintext so it is included in my list.

fredrikwidlund commented 7 years ago

@stevehu So you think that frameworks that currently only take part in the json and plaintext tests by default should be considered "partial implementations" and therefore removed? Personally I find this somewhat drastic but as long as it is communicated clearly and well in advance the rules are what they are.

stevehu commented 7 years ago

@fredrikwidlund No. I don't think they need to be removed but there should be a star(*) ahead or after the name to indicate they are partially implemented and a foot note to explain what it means. Another approach is to add an extra column but I think it is too much work.

simongui commented 7 years ago

Hey all, thanks for all the great comments! I consider these suite of benchmarks a suite of tests that benchmarking different functionality and not all entries will participate in all tests and each test has it's own set of requirements that benchmarks specific functionality.

Let's review some of the requirements that I've extracted from Test type 6: Plaintext

This test is not intended to exercise the allocation of memory or instantiation of objects. Therefore it is acceptable but not required to re-use a single buffer for the response text

I'll point out no fully featured HTTP library or server behaves this way so this is essentially including more network-level libraries and experimentations that are not fully featured HTTP libraries. These are still valid use cases for some HTTP implementations. For example, Octane is used to implement a high performance in-memory cache that responds to HTTP requests.

There's also no mention that the HTTP library needs to have a routing component. It's also pointless for this benchmark to require a routing component of 1 single route. That's not benchmarking anything. I could optimize a routing component that works great for 1 route but is terrible for 2 routes. Should I build that just to satisfy a benchmark requirement in a way that's not actually usable for routing? This is why this simple plaintext benchmark shouldn't require routing at all and there should be routing benchmarks added to benchmark that kind of functionality which libraries like Octane would not participate in.

It's really cherry picking to say that in this benchmark the requirements are you can avoid all mallocs and buffer management (totally unrealistic use case) but it must require a routing component. These requirements dictate that this benchmark is mostly benchmarking the networking efficiency of the library.

The fact this benchmark states you don't need buffer management puts this entire plaintext benchmark in a really experimental category because at this benchmark speed, the buffer management is likely the source of contention and this benchmark removes that requirement completely.

This benchmark is about responding with a static plaintext body response but with dynamic headers like the date/time.

This entry satisfies the requirements specified and I believe it satisfies what this test is benchmarking. I discussed on IRC with the Techempower folks before submission about what this included and didn't include and it was cleared. They asked me to pull out the code I did have in my repo and put it into the FrameworkBenchmarks repo, so that is what I did.

If fuller features libraries want a benchmark that measures plaintext responses that includes routing functionality and buffer management then there should be another test created for that which makes requests using some standard set of URL's and arguments that includes dynamic variable extraction. There's no point in a 1 URL no-buffer management test including routing components that aren't being exercised. That proves nothing.

Everyone really needs to ask themselves what is this test actually exercising. That is all that the requirements should include. If it's not testing other functionality then it doesn't belong in the requirements.

I know people are upset because this scored at the top and it upsets the people it jumped over but the bottom line is it's an extremely minimal test that excludes a lot of uses of malloc. Because of this it will have minimal implementations for contestants which are foundations people can use for building some limited software on top of. More fully featured libraries won't and shouldn't win this specific test. It's very narrow in scope and libraries that do less will do better.

NateBrady23 commented 7 years ago

@simongui Thanks for the quick response!

fredrikwidlund commented 7 years ago

@simongui As I understand it this is a benchmark of HTTP software. Octane is simply and only a TCP server which does not implement HTTP in any way. Besides that there is basically no implementation except for libuv, it is severely broken, and the result error-prone.

I am personally not upset in any way. It is just that allowing entries like this in the benchmark makes the results misleading and hollow. People spend time trying to make entries perform well and allowing shortcuts like not parsing requests at all will just corrupt the results and the spirit of the benchmark. And then what is the point.

bhauer commented 7 years ago

@fredrikwidlund Thank you for spending time to review implementations and bring your concern about Octane's implementation to our attention.

We'll take a look as well. @methane brings up a very good point when he asks, "How about just remove it?"

I feel there are two questions:

  1. How much of the HTTP protocol needs to be implemented for us to feel satisfied that the implementation is an HTTP server? What is the threshold?

  2. What do we do if an implementation doesn't meet that threshold?

The answer to the first is likely to be a bit fuzzy unless we get into gory detail, which I would prefer to avoid. Inevitably the question is "What is an HTTP server?" I believe that any honest observer would judge that an implementation that is receiving HTTP dialog and looking at only the request URI and then sending canned HTTP dialog as a byte buffer is not an actual HTTP server, whatever that is. Put another way, it should be clear at a glance that the HTTP-processing capabilities are considerably broader than the bits we are exercising here.

The Plaintext test type's requirements were expressed as requiring the Server and Date response headers not because we care about those headers per se but becuase their presence is indicative of an actual HTTP library. It's not surprising, however, that we have reached the eventuality where we need to contend with implementations that test the boundaries of the requirements we created.

This project was intended to exercise frameworks and libraries that were credible HTTP servers. However, as the scope of the project has grown, we've accepted implementations that stretch that definition, and in some cases are basically fine-tuned to implement just the pieces necessary to satisfy our test requirements. That does ring of the concept we classified as "Stripped." But Stripped was intended to indicate implementations that are disabling features that would normally be left enabled in production deployments; it was not originally intended to cover implementations that are not actual HTTP servers. Although in practice, we may have used it to cover those as well in the past.

I agree with the general sentiment that the results should give some assistance to the reader to understand the capabilities of the software being tested. We often say that a reader should look at the source code to fully interpret the results, but we provide a bunch of filterable attributes as a convenience. The full spectrum of implementation choices cannot be expressed visually and by filterable attributes, but it may be the case that this scenario can be addressed by one of our attributes.

My tentative thinking is that we should either use the existing "Stripped" classification for implementations that are not actual HTTP servers (again, whatever that ends up meaning) or we create a new classification to indicate whether the HTTP processing capabilities meet whatever threshold we specify.

That said, I am also comfortable with outright dropping implementations that do not meet the threshold of implementing HTTP, whatever we end up defining that as. In that case, Stripped would remain for covering the separate case of implementations that strip down the configuration/settings of an otherwise normal web stack.

There is a separate dialog in this thread about indicating tests that are not implementing all of the test types. It is considered acceptable to implement only a subset of the test types. That said, we do expect that all implementations use some form of request routing (hence the initial sentence of the Plaintext requirements: "This test is an exercise of the request-routing fundamentals only..."). We would expect that implementations look at the requested URI and "route" to some handler code. Even if the implementation in question only handles one test type, routing of some flavor should be included.

However, some frameworks do not include a canonical request router, and in some cases it is considered "production grade" to use minimalist routing functionality that boils down to nothing more than something like if/else if/else conditionals. That is acceptable, although I consider such an implementation particularly telling about the functionality (or lack of functionality) of the framework when routing is done by-hand.

fredrikwidlund commented 7 years ago

@bhauer Since Octane does not even parse out a method or a request-uri no request routing is possible to begin with. In addition the assumption that '\n\r' will mark the end of a request is invalid. My proposal is that it is disqualified from round 14 due to this and the result removed.

simongui commented 7 years ago

@bhauer This quote from the requirements This test is an exercise of the request-routing fundamentals only is really incorrect because the benchmark doesn't do any testing of routing fundamentals at all. This test is really an exercise of response writing. That's what the test does and that's what the requirements have dictated. If you want your test to be an exercise of request-routing that would require actually routing to more than one place and getting different results. Stating that it must exist for test results to be valid while not exercising that component in any way makes the inclusion of that requirement invalid and people can just make massive cheats that are useless outside of this benchmark environment.

This is a really blurry definition to say that some components must exist, but they don't really get used in the test. You've created explicit requirements that state that people can REALLY cheat on memory allocation and buffer management but they must contain other more sophisticated functionality?

Some fully featured HTTP servers you can configure to route and deliver a static response or static page on disk with a dynamic server and date/time and it'll completely bypass parsing a lot of the HTTP request and only route on the URL. You can even configure them to bypass parsing the URL and only respond to a host/port. How's that much different than Octane? It's not much different and is a similar requirement to your plaintext requirements.

bhauer commented 7 years ago

@fredrikwidlund Understood. We will meet over here to discuss. I'd like to hear additional opinions as well if anyone has thoughts.

If we do remove Octane based on the dialog above, I would like us to do two additional things:

  1. Clarify the requirements to better inform potential contributors of our project's intent and expectations. I will want us to clarify what meets the threshold, what satisfies our definition of an "HTTP server." I would prefer that to be fairly fuzzy, conceding that fuzziness is what leads to situations like this. But I don't want the requirements to be so particular that we have an opposite problem of otherwise-credible systems being incompatible with our requirements.

  2. Spend some time reviewing other implementations to (attempt to) be consistent. @simongui is correct that some similar implementations may fly under the radar simply because (for whatever reason) they don't get into a top position. But on the other hand, it seems reasonable that implementations that lands in top positions should expect to receive some outside review and questioning.

bhauer commented 7 years ago

@simongui There are many aspects of the requirements that our toolset does not test for. In fact, there is no way for our toolset to automatically know whether some form of request routing is being done on the server. While the existence of only one test type implementation indicates that request routing may be avoided, it does not confirm that indication. Inspection of the source code is required to determine whether some routing is occurring.

This thread is an example of the process of identifying implementation tactics that violate the spirit of the project or particular test type ("cheat" is a strong word that I don't think applies in most cases) and clarifying the requirements to remove ambiguity.

This is a really blurry definition to say that some components must exist, but they don't really get used in the test. You've created explicit requirements that state that people can REALLY cheat on memory allocation and buffer management but they must contain other more sophisticated functionality?

The requirements allow for the response payload to be a singular byte buffer that is not re-allocated for each request. The rest of the HTTP dialog was intended to be provisioned by a credible HTTP server, although as I said above, the definition of "credible HTTP server" is not evident today.

We have iterated the requirements to their current state based on past similar discussions about boundaries. And we will continue to iterate them based on discussions such as this one.

You seem to be evaluating the requirements from the point of view of starting from scratch with no HTTP server and building a server expressly to meet the stated requirements. However, the intent of the project and even this test type was to measure the performance of real-world web application frameworks and platforms (e.g., Netty). Most implementations consume the requirements as guiding how to configure their existing HTTP to do the things we want (e.g., some "real-world" servers don't include the Server header by default). Aside: we also required the Date and Server headers across the board since with such a small response payload, removing a header impacts the total response size significantly.

But to be clear, a requirement not being specifically tested by the toolset is not an indication that the requirement is considered optional. Whether or not implementations that ignore requirements are detected is a matter of manual review such as the impetus for this Github issue.

Some fully featured HTTP servers you can configure to route and deliver a static response or static page on disk with a dynamic server and date/time and it'll completely bypass parsing a lot of the HTTP request and only route on the URL. How's that much different than Octane? It's not much different and is a similar requirement to your plaintext requirements.

I think that is the point, however. If a fully-featured HTTP server were configured to do this, I think people like @fredrikwidlund would be (partially) satisfied. That would be a "Stripped" implementation because it had been configured to bypass a lot of its standard functionality. But it's still based on a credible HTTP server.

The question is whether something that is coded specifically for our test type requirements is merely Stripped or if it's a different classification altogether.

kellabyte commented 7 years ago

I'm with @simongui on this one. This plaintext benchmark is a test of how fast a server can respond to incoming requests with a static response with no dynamic memory allocation or buffer management and valid dynamic server + datetime headers. His code satisfies that. This is a valid test too in my opinion because my own framework (Haywire) which has competed in Techempower for like 2 years isn't much more sophisticated than that.

If you want to parse HTTP requests completely before responding then the servers response must contain something parsed out from the request. This test does not do that or verify that.

As @simongui says, you can configure a HTTP server to deliver a static response from an IP:PORT and skip parsing all together and it behaves much like your benchmark and his code. A HTTP web browser can talk to it just fine. That's a real world use case. Is his code useful for more fully features implementations? No but it's a foundation to do so.

I think what @simongui is trying to say is you're now redefining the test to require a bunch of functionality that the test doesn't even execute. Like, it must exist but it can be super hacky, your requests don't use them, but it must be there for you to accept the entry. That's really odd. Your requirements should be defined based on what the test executes, causes and expects.

You're really not doing that by defining internal implementations can avoid buffer management. That's not how benchmarking parameters work. You set input and expected output and it doesn't matter how the feature was implemented as long as it passes. A server is allowed to optimize and cheat and completely disable components as long as it can satisfy the expected results. But when you test other features, it may re-enable those features and perform more poorly to satisfy that other tests expected result.

Your tests execute specific things and it exposes a set of capable features at a measured performance. What you're proposing is creating requirements for hidden and unused components for tests that do not use them.

fredrikwidlund commented 7 years ago

@kellabyte It would be interesting to hear about a fully featured HTTP server that does not understand where a request start or end, specially combined with keep-alive. There is none because it simply does not work.

volyrique commented 7 years ago

My main concern is that the application responds in exactly the same way to the following requests:

GET /plaintext HTTP/1.1\r\nHost: 127.0.0.1:8000\r\n\r\n GET /fortunes HTTP/1.1\r\nHost: 127.0.0.1:8000\r\n\r\n Hello, world!\r\n\r\n

Note that the last one is not a valid HTTP request. In fact, sending Hello, world!\r\n crashes the program.

While I am not against having a bare bones TCP server as reference for the limit on the performance that can be achieved (and I did in fact write a similar application for testing purposes - it was even simpler), it should be marked as such, and the stripped classification seems to be the closest option.

kellabyte commented 7 years ago

I'm not oppose to the stripped idea. Debating whether it follows proper HTTP parsing is useful as well. I imagine many entries here have faults with parsing HTTP.

But the real danger you get in making requirements that the test does not utilize is that someone could go take http-parser which is used in many of the top results in this benchmark, strip a bunch of functionality out, just to detect end of requests correctly and win.

I stress again, proper testing is supposed to be a form of:

  1. Define requirements.
  2. Test those requirements.
  3. Measure the results.

That's not what is being discussed here.

fredrikwidlund commented 7 years ago

@kellabyte There can hardly be any debate wrt if it follows proper HTTP parsing. There is no parsing going on.

Try sending

POST / HTTP/1.0\r\n
Content-length: 16\r\n
\r\n
\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n

If it doesn't explode you should get a big bunch of replies back. In fact a suggestion for an Octane improvement is just doing a read(), guessing the number of requests without looking at the data, add a few for the fun of it, and then just pump a load of responses back. Would work great with the benchmark in place. wrk would actually have the replies waiting straight away without even having to worry about the roundtrip time of the request.

dantti commented 7 years ago

My 2c here is that IMO the "Plain Text" name of the tests is already a problem.

The user opens the website looks at plain text, then to JSON and the results are fairly different, if you disable pipe-lining on Plain Text the results should be just a little better that JSON.

No client will pipe line 4 millions of requests, so it's at first completely unrealistic.

However I do agree that if your "Web Framework" can only do a single reply, it's a bit useless, and somehow would fall of the "Framework" thing. But still adding a requirement to parse the HTTP method and send a 404 on a different string won't change much the results.

So IMHO the best is to change the test name to something like "Pipelined RAW Throughput"

bhauer commented 7 years ago

Having discussed this verbally with some of the people here and having reviewed the latest comments, here is my current thinking.

First, I'd like to modify the test requirements to clarify our intent with this project, including the Plaintext test type. Namely, we intend that this project measure the performance of production-class deployments of realistic software. That is already stated as the first General Requirement, but it stands to be repeated since it needs to also segue into a discussion of what that means with respect to implementing HTTP.

Rather than elaborate in detail a set of specific HTTP-processing capabilities that we expect to be present, I will attempt to communicate the notion that the software should pass muster as a realistic HTTP server to reasonably unbiased outsiders familiar with HTTP. We do not (yet?) have a suite of validation tests to attempt to confirm that HTTP is being processed in a realistic way. However, this seems like something that most of us know when we see it.

I feel measuring the performance of "academic" (for lack of a better term) servers that fulfill only the expressly-tested requirements of our benchmark is of relatively low value to the audience at large and possibly high value to a very small portion of the audience.

This project is attempting to provide a high-water measurement of performance of real-world frameworks and platforms such that a reader can factor the results into a decision-making process. Including results from frameworks/platforms that are a small academic project contributed by our community is not particularly useful to the intended audience. On the flip-side, we are happy to get contributions from people who are perhaps working toward a future production-grade platform and just want to get their initial version into our tests to see how it compares. Or whatever.

The Stripped classification strikes a good balance. Marking such implementations as Stripped allows us to continue measuring their performance, which is potentially interesting to the specific contributors and more academic audience members, while keeping them hidden from view by default. (Stripped tests are not visible by default and the reader must elect to make them visible using the filters panel.)

This partially dilutes the definition of Stripped, but I suspect we've already done so, and I don't feel concerned about it. We can re-evaluate separating the two different notions into its two classifications again in the future if it feels warranted.

There is also a thread of discussion about including requirements that we do not test for in our toolset. I won't enumerate each case, but to quickly reiterate, there are several requirements that we currently do not have automated means of enforcing. We certainly would gladly accept contributions to increase our ability to automatically validate test implementations, but I suspect it will just be a continuous effort. I also suspect there are several requirements that are infeasible to enforce with automation.

For example, another recent issue, #2782, identified an implementation that queued its database writes, and this was detected manually by another contributor. The automated tools don't presently have a way of knowing that the writes in the Updates test are committed during the scope of the request or if they were queued and committed some time later.

This issue with respect to Octane, for what it's worth, is limited to the Plaintext test type. Looking at this project from the lens of the Plaintext test exclusively distorts things somewhat. This project is a web application framework benchmark that happens to accept HTTP platforms as additional data points, treating them as if they were frameworks in their own right.

To be clear, while the Plaintext test type is a fun and very competitive space, we (TechEmpower) see it as the least-informative to the intended audience. It was added to satisfy a portion of the contributor community that wanted to showcase their platforms' supremely fast HTTP request routing and HTTP processing capabilities. HTTP Pipelining was used to maximize the rate at which that HTTP processing code was exercised, by removing a lot of the network latency bottleneck. It is likely those original contributors who would be among the most disconcerted that some of their competitors aren't implementing much of HTTP.

There is a reason that the default test type view is now Fortunes. While all of our tests are very short on requirements and functionality by design, Fortunes is the most complex of them and therefore the most interesting to the majority of our audience. Of our test types, it covers the broadest set of functions one might expect in a web framework.

To summarize:

  1. We'll mark Octane as Stripped, and potentially others that are of a similar implementation style. Doing so does not remove the results altogether but it does make them invisible by default. Stripped results can be made visible by enabling Stripped in the filters control panel.

  2. The Stripped classification will be clarified to also cover this scenario.

  3. The requirements will be clarified to affirm our intent to measure production-grade real-world HTTP servers. A fuzzy definition will be included.

bhauer commented 7 years ago

Following-up:

  1. I've changed the classification of Octane as described. Please bring any other similar implementations to our attention.

  2. I've clarified the Stripped classification in the hover tooltip for the Implementation Approach attribute and also in questions 29 and 31 of the Questions page.

  3. I've elaborated on the HTTP and request-routing expectations in the General Requirements and also indicated in the Plaintext section that Plaintext, like all test types, is subject to the General Requirements.

simongui commented 7 years ago

As per this discussion I have submitted a PR that adds proper HTTP parsing and supports both /plaintext and /json endpoints.

https://github.com/TechEmpower/FrameworkBenchmarks/pull/2817

I believe this is satisfactory for Octane to return to normal status and I expect Octane to show up in the results normally like the other results and not be categorized as stripped.