TechEmpower / FrameworkBenchmarks

Source for the TechEmpower Framework Benchmarks project
https://www.techempower.com/benchmarks/
Other
7.64k stars 1.94k forks source link

Code review and Test validations #6101

Closed pmlopes closed 4 years ago

pmlopes commented 4 years ago

I've looked at the current CI runs and saw that lots of new frameworks are ruling the top of the benchmark. I got very curious about the entry just-js, mostly because it uses JavaScript, and I'd like to make a small code review:

The wiki states that code should be production ready. Although this is quite subjective, I find it hard to believe that a user would write the HTTP protocol manually like this:

let time = (new Date()).toUTCString()
let rHTML = `HTTP/1.1 200 OK\r\nServer: j\r\nDate: ${time}\r\nContent-Type: text/html; charset=UTF-8\r\nContent-Length: `
let rTEXT = `HTTP/1.1 200 OK\r\nServer: j\r\nDate: ${time}\r\nContent-Type: text/plain\r\nContent-Length: `
let rJSON = `HTTP/1.1 200 OK\r\nServer: j\r\nDate: ${time}\r\nContent-Type: application/json\r\nContent-Length: `
let r404 = `HTTP/1.1 404 Not Found\r\nServer: j\r\nDate: ${time}\r\nContent-Type: text/plain\r\nContent-Length: 0\r\n\r\n`

Unless you're trying to game the benchmark and skip the whole HTTP protocol parsing and use the socket directly.

More things that quickly jump out to the eyes are for example the updates test, where the update statement is hardcoded for the benchmark run, although it is stated that the number of queries is a variable between 1 and 500 I wouldn't call it production code:

  sock.updateWorldById20 = await compile(sock, {
    formats: [{ format: 1, oid: INT4OID }],
    sql: `update world set randomnumber = CASE id 
when $1 then $2 
when $3 then $4 
when $5 then $6 
when $7 then $8 
when $9 then $10 
when $11 then $12 
when $13 then $14 
when $15 then $16 
when $17 then $18 
when $19 then $20 
when $21 then $22 
when $23 then $24 
when $25 then $26 
when $27 then $28 
when $29 then $30 
when $31 then $32 
when $33 then $34 
when $35 then $36 
when $37 then $38 
when $39 then $40
else randomnumber 
end where id in ($1,$3,$5,$7,$9,$11,$13,$15,$17,$19,$21,$23,$25,$27,$29,$31,$33,$35,$37,$39)
`,
    fields: [],
    name: 's4',
    portal: '',
    maxRows: 0,
    params: Array(40).fill(0)
  })
  sock.updateWorldById15 = await compile(sock, {
    formats: [{ format: 1, oid: INT4OID }],
    sql: `update world set randomnumber = CASE id 
when $1 then $2 
when $3 then $4 
when $5 then $6 
when $7 then $8 
when $9 then $10 
when $11 then $12 
when $13 then $14 
when $15 then $16 
when $17 then $18 
when $19 then $20 
when $21 then $22 
when $23 then $24 
when $25 then $26 
when $27 then $28 
when $29 then $30 
else randomnumber 
end where id in ($1,$3,$5,$7,$9,$11,$13,$15,$17,$19,$21,$23,$25,$27,$29)
`,
    fields: [],
    name: 's5',
    portal: '',
    maxRows: 0,
    params: Array(30).fill(0)
  })
  sock.updateWorldById10 = await compile(sock, {
    formats: [{ format: 1, oid: INT4OID }],
    sql: `update world set randomnumber = CASE id 
when $1 then $2 
when $3 then $4 
when $5 then $6 
when $7 then $8 
when $9 then $10 
when $11 then $12 
when $13 then $14 
when $15 then $16 
when $17 then $18 
when $19 then $20 
else randomnumber 
end where id in ($1,$3,$5,$7,$9,$11,$13,$15,$17,$19)
`,
    fields: [],
    name: 's6',
    portal: '',
    maxRows: 0,
    params: Array(20).fill(0)
  })
  sock.updateWorldById5 = await compile(sock, {
    formats: [{ format: 1, oid: INT4OID }],
    sql: `update world set randomnumber = CASE id 
when $1 then $2 
when $3 then $4 
when $5 then $6 
when $7 then $8 
when $9 then $10 
else randomnumber 
end where id in ($1,$3,$5,$7,$9)
`,
    fields: [],
    name: 's7',
    portal: '',
    maxRows: 0,
    params: Array(10).fill(0)
  })

It is clearly gaming the benchmark, also the requirements state that updates should be individual or batching, and as you can see from the query, it's not a batch but a single query.

This example just stand out because I wanted to learn how to improve my code, but IMHO, this is obviously an attempt to game the benchmark and should be clearly marked as nothing you would run on production but something you use when you want to get as fast as possible.

What is your take on this kind of behavior @nbrady-techempower @bhauer ?

I'd propose to call a full peer code review to prevent gaming and have honest benchmarks!

NateBrady23 commented 4 years ago

it's not a batch but a single query

This is allowed per the test rules. We allow a single update query that affects multiple rows for updating but not for fetching.

The rest is something to look at. Thanks!

billywhizz commented 4 years ago

hi @pmlopes. i am the maintainer of the just-js tests and the owner/maintainer of the project they are based on. i have just noticed this issue you raised and felt i should respond.

firstly, i have to say i find your post quite surprising and disheartening. if you had questions you wanted to ask you could have just asked them instead of jumping to conclusions and making accusations of gaming or cheating the system.

i will however address the points you made above.

Point 1: not "Production Ready"

i was very clear when i made my initial submission that the code was written specifically for the benchmark and that the platform it is based on is at an early stage of development. as far as i can see there are a number of other frameworks which could be considered experimental or pre-release so i find it odd you chose mine in particular to be critical of. there are also a large number of frameworks which have implemented specific micro-optimizations in order to achieve better results so i don't see mine as exceptional in this regard.

when i started this effort i spent quite some time looking at the sources of the best performing frameworks and tried to understand how they achieved their scores and followed their lead in some places. isn't that kind of the point of the techempower benchmark in the first place? it lets the community as a whole understand where there may be room for optimizations and, judging by the continually increasing scores, i think it has been quite effective in driving performance improvements across a whole range of languages and frameworks. this is a good thing.

all benchmarks are artificial by definition and this one, as it is currently configured, is singularly focused on peak performance. it doesn't measure robustness, security, ease-of-use. it doesn't use latency or error rates in the calculation of it's scores, or even rps/thread which, in my opinion, would give a better picture of performance than the raw rps numbers. it doesn't take into account memory usage. the sole metric as it stands is throughput under saturation. ideally, i would like to see the numbers of the postgres server and wrk client published too so we can get some idea of when the database or the client become the bottleneck.

my entry is definitely not "Production Ready", but not because the code in my entry is not "realistic", merely because the underlying platform has not matured yet to a level where anything written on top of it could be considered ready for production.

all of the code under lib/ is generic library code which could be re-used across projects and is not specific to the techempower test. the only test specific code is in the main script: https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/JavaScript/just/techempower.js. i have not yet got around to finishing work on a module system for the platform so the lib/ files are included here directly but at some point soon will be replaced by imported modules in the main script. the main module code also has a lot of boilerplate that could be refactored into library level code with no loss in performance. v8 jit is really amazingly good if you are careful in how you craft your JS and cause as few heap allocations as possible.

i am also pretty sure most folks who keep abreast of the results don't make decisions about which framework or language to use in theor projects based solely on how well they score on the techempower benchmarks. it is a useful tool that gives some insight into where some frameworks may excel or struggle and some idea of a peak theoretical performance given ideal conditions, but there are a whole host of other factors to consider when choosing a framework.

in many ways i have implemented my platform in reverse. very often, folks implement a working version of the concept initially and only later look at ways to optimize. in my case, i decided to establish a performance baseline from the outset and then use any effect on performance as a criterion for deciding on additional functionality. scoring well in techempower became a driver for me in establishing this baseline and gave me some confidence that i could meet the goals i have set for the project in making it a viable option for building high performance system level and networked software in javascript.

i wrote a little more about my journey here if you care to read and was hopefully explicit there about all things that should be taken into consideration when assessing these results. all benchmarks like this should be taken with a grain of salt and it would be nice if we saw it as more of an enjoyable community pursuit in lifting all boats with the rising tide rather than proving which project/frameworks is "the best".

personally, i would be happy to see a baseline framework that was highly optimized included. maybe something written in a lower level language and with heavy optimization and very little abstraction. this would give the community a good idea of how much overhead each framework adds to a practical maximum throughput for the various scenarios tested.

maybe it seems odd a brand new, pre-release alpha framework would score so well, especially one written in javascript. the reality is one of the primary focuses of the platform i am working on is performance and simplicity. it is a low-level platform with less than 5k lines of c++ and javascript in the core. it is linux only, not cross-platform. it doesn't come with a lot of tools out of the box, mostly just simple wrappers around linux system calls and some simple abstractions in JS.

Point 2: "trying to game the benchmark and skip the whole HTTP protocol parsing"

the rules are very clear that pre-calculating the http timestamps at regular intervals is allowed and other frameworks are also pre-calculating headers that are common to all or a range of requests such as Server and Content-Type. this seems like a sensible optimization to make, especially in a garbage collected language where doing multiple unnecessary string allocations/concatenations on every request would add a lot of overhead.

i don't see any issue here myself with how i do this. it's pretty clear i am running a timer that fires every 1/10th of a second and updating common headers including the timestamp for each type of response so they don't have to be computed on every request. at some point, this can be abstracted away behind a nicer api which should not have any performance cost.

so, yes i am optimizing here. but, as i stated earlier, there are a large number of frameworks doing many such micro-optimizations to lessen overhead and as far as i can see this is perfectly within the rules and indeed should be encouraged. this, as i see it, is literally the whole point of the framework benchmarks as they are currently configured. i.e. to see "how fast can it go" under ideal conditions.

my test is certainly not skipping protocol parsing. it is using a http module here: https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/JavaScript/just/lib/http.js which uses a c++ binding here, based on picohttpparser. it receives a callback on every http request (or batch of pipelined requests for the plaintext test) and when the result has been composed, the response is written out to the socket using javascript string literals to concatenate the pre-calculated generic headers and the specific headers for the request being handled, e.g.: https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/JavaScript/just/techempower.js#L340.

Point 3: "It is clearly gaming the benchmark"

This has already been addressed in brief by @nbrady-techempower above. issuing a single update request per batch of queries in the update test is explicitly allowed and encouraged in the rules. indeed i picked up this technique by looking at the source for the other frameworks near the top of the rankings. yes, the code in my script is currently not very pretty but it would be a small change to create a generic function that compiles a prepared query on the fly based on the count being passed in on the query string and then i wouldn't need to have the hard-coded ones for 20/10/5 as i have currently. this would have no impact on the results as the on-the-fly compilation would only happen on the first request.

Summing Up

having said all of the above, i would like to ask you again what exactly are your objections and what action would you think appropriate? nothing i have done as far as i can see is outside the spirit or rules of the project but i am happy to be corrected on that by the project owners. if i am guilty of anything i am guilty of exceeding my own expectations when i set out on this journey. most of the credit for that should go to the incredible team working on the v8 engine at google. the performance improvements in recent years have been enormous and the results you see here have far exceeded what i had hoped for when i started.

i find it particularly unfair that i, as a single developer working on my own on what has been to date a hobby/aspirational project, am being accused of gaming/cheating by someone who is an active maintainer on at least two other frameworks based on a platform that is part of the eclipse foundation and has been heavily backed by a couple of very big industry players, one of which you are employed by if i am not mistaken. to me, this smacks of gatekeeping and is against the spirit of the techempower project as i see it.

maybe what you are looking for is an extra category/scale for how high-level or low-level a framework is. it would be useful to be able to filter on something like that but i still think the overall results should be ranked on raw performance and observers can make their judgements based on their needs.

i'm happy to make any changes suggested by @nbrady-techempower and @bhauer and am hugely thankful for the work they do bringing this service to the community and i certainly don't wish to be seen as "gaming the system" and disrespecting their efforts.

pmlopes commented 4 years ago

Hi @billywhizz I'm sorry if my comments came out has accusations. That was not the intention. Indeed my question to the benchmark owners was how to clarify what production ready means and what is/isnt allowed.

The point with the being production ready and and optimization of headers is tricky (and again it's just my own opinion). I understand that I can optimize by caching a header for 1 second, but if I read it correct, you're just composing the http protocol response in one string. This means the optimization suits the test (because the test is trivial) but isn't suitable for common situations such as multipart form data, chunked responses, etc...

So again, no hard feelings, sorry if I offended you. I just wanted to be sure we all play by the same rules or if the rules are clear to everyone (as an example I was mistaken about the updates).

msmith-techempower commented 4 years ago

The only thing that stood out to me is the way the sql is formed. I agree that I wouldn't personally use it in production, but that's just preference, I think. Based on the blog post I read about Just(js), it seemed like this was just the price paid for having the database driver in lower-than-JS layer of the framework.

I understand that I can optimize by caching a header for 1 second, but if I read it correct, you're just composing the http protocol response in one string. This means the optimization suits the test (because the test is trivial) but isn't suitable for common situations such as multipart form data, chunked responses, etc...

There are plenty of top performers who make the same optimizations to try and score better on the benchmarks. The argument could be made that we should mark such unrealistic approaches as stripped, but it feels unfair to only apply such limitations on one framework in the name of fairness.


To be clear, we do want test implementations to at least attempt to be production grade, but we also want to encourage contributors to implement cool stuff and compete on performance. Personally, I love JavaScript and would be willing to write more of it server side if it were more performant; so, reading about Just and seeing it perform well on the benchmarks is awesome!

NateBrady23 commented 4 years ago

I don't think there's anything more to do here at the moment. I've already asked @billywhizz if he can make moving the /lib stuff out a priority. In this way, it's a lot clearer to everyone how a user of the framework might instantiate some of these tests. In that regard, we often don't see the innerworkings of a lot of the frameworks here and that's precisely what the test validations are for. We're continuing to improve on those and I think you guys will see some cool stuff in that area in the comings months.

Thanks for keeping the discourse civil! Going to close this for now.

billywhizz commented 4 years ago

Hi @billywhizz I'm sorry if my comments came out has accusations. That was not the intention. Indeed my question to the benchmark owners was how to clarify what production ready means and what is/isnt allowed.

The point with the being production ready and and optimization of headers is tricky (and again it's just my own opinion). I understand that I can optimize by caching a header for 1 second, but if I read it correct, you're just composing the http protocol response in one string. This means the optimization suits the test (because the test is trivial) but isn't suitable for common situations such as multipart form data, chunked responses, etc...

So again, no hard feelings, sorry if I offended you. I just wanted to be sure we all play by the same rules or if the rules are clear to everyone (as an example I was mistaken about the updates).

@pmlopes no worries. thanks for responding. i'm sorry if i over-reacted a little! =)

just to be clear, i don't pre-compose the complete headers. i only pre-compose the common parts of the headers that will be the same on each request (Server, Content-Type, Date (updated every 100msec) and the beginning/name of the Content-Length header). Then on each request, I populate the rest of the headers (as well as the actual size in Content-Length). I think that is fair and the headers as a whole are still generated dynamically based on the particular response being served.

@nbrady-techempower I should be able to pull this together and "modularize" it over the next couple weeks and hopefully that will make it look a bit less "suspicious". thanks to all. i've really enjoyed the challenge of pushing the boundaries of JS here and can now focus on more important things like functionality and ease of use. ;)

billywhizz commented 4 years ago

I've updated my response above to remove the "Accusation" bit. sorry again if that was OTT.

billywhizz commented 4 years ago

@pmlopes btw - i really like the vert.x project - have been following it for a long time. i hope to see it continue to thrive!