TechEmpower / FrameworkBenchmarks

Source for the TechEmpower Framework Benchmarks project
https://www.techempower.com/benchmarks/
Other
7.54k stars 1.94k forks source link

Database commands coalescing #7019

Closed sebastienros closed 2 years ago

sebastienros commented 2 years ago

The Npgsql (PostgresQL driver for .NET) contributors have been looking into other drivers implementation to understand where improvements can be done.

Here is a current picture for the multi-queries scenario:

image

There are several frameworks around 30-34K (not all displayed here) and then a subsequent jump to ~60K.

After analyzing the network traffic between the Drogon framework and the Postgres server, which is represented in the following screenshot and validated by other specific payloads, it appears that queries issued from multiple http requests are coalesced in the same network packets and executed with a single "SYNC" command which will tell Postgres to execute them "as a batch".

image

Bundling together multiple commands in a single Sync batch has important consequences: if any command errors (e.g. unique constraint violation), all later commands are skipped until the Sync. Even more importantly, all commands are implicitly part of the same transaction.

Here we have not only implicit batching for the same request (20 queries), but more importantly batching across requests.

Our interpretation of the TechEmpower rule below leads us to believe that this method violates those rules

Except where noted, all database queries should be delivered to the database servers as-is and not coalesced or deduplicated at the database driver. In all cases where a database query is required, it is expected that the query will reach and execute on the database server. However, it is permissible to pipeline traffic between the application and database as a network-level optimization. That is, two or more separate queries may be delivered from the application to the database in a single transmission over the network, and likewise for query results from the database back to the application, as long as the queries themselves are executed separately on the database server and not combined into a single complex query.

We haven't looked if the other frameworks showing the same high performance are using the same technique.

/cc @an-tao @roji

roji commented 2 years ago

To add to @vonzshik answer, and to summarize my views (once again):

billywhizz commented 2 years ago

@vonzshik

...but saying that a driver isn't going to support calling functions with side effects to make queries run faster in extreme cases doesn't look that great.

i didn't say this. these are choices a developer can make and options the driver can offer. it's perfectly legitimate for the developer to decide to have a connection which has only queries with the properties above while having a different connection that has different properties (like sync on every statement).

And if we do allow this, where exactly we're going to draw the line between being acceptable or not? For example, having a select-only supporting driver might make sense for someone.

once again, i never proposed a "select only" driver. i said it is legitimate and safe to pipeline or batch statements under certain conditons and if the driver implements this pipelining correctly and handles errors appropriately. the only place in the techempower tests where a sync is absolutely required after an exec in order to avoid returning incorrect results is directly after the update statements(s) in the updates test. all other queries have no side effects and are legitimate candidates for this kind of pipelining.

That's not going to works if we want users to decide the platform/provider they're going to use based on the benchmark results since every single driver on the top is going to take shortcuts (which usually are not publicized and are going to blow in your face eventually).

again, you are making assertions that are not backed up by evidence about potential failure scenarios that nobody on this thread is able to demonstrate in the context of the techempower tests as they are currently configured. do you really believe the techempower results have that big an influence on people's choice of database driver? given the dominance of voices from MSFT/ngpgsql community on here i think it's more about you being able to continue to promote your framework using the techempower scores - something you have done multiple times in the past and is in fact even mentioned on your github landing page.

Even not taking such an extreme example into account, I have no doubt in my mind that every single one of us here can make a 'raw' provider (I actually have one for testing potential optimizations and platform's overhead) which does satisfy every single benchmark requirements, but forgoes some of the things you expect from a driver (for example, the driver I linked doesn't check whether a database returned an error, this has to be handled by a user). Again, this doesn't make it useful either to us or to users (due to extreme requirements you have to have to use these drivers).

i think anyone making these kinds of decisions should be looking at the code and asking questions about implementations for less popular frameworks/libraries here. i don't buy the argument that the raw RPS numbers have a big influence on these choices. if they actually do, well then i think that says a lot more about the people making the decisions than about any of the frameworks on here.

To summarize, having a general-purpose driver optimize some use cases with explicit user's approval is great.

i agree the decision should be one made explicitly by the developer but i keep seeing reference to "general purpose" database drivers. i didn't see anything in the rules that excludes database drivers based on how general purpose or fully featured they are.

Whether we should have separate benchmarks (or mark these contenders with a tag) is debatable. Normalizing such optimization by saying that there is "little to no loss of security or robustness" will only encourage people to create drivers to abuse benchmarks in a race to the top. While I do work on PG driver in my spare time, I'm definitely not an expert and as such, I would rather be a bit more conservative while dealing with a potential data loss or even data corruption.

again, you are making assertions without any strong evidence that failures and data loss will happen if this pipelining technique is used. while that may be true if you used it without understanding the implications or explicitly opting into it that's fair but as you seem to say above this is a valid thing to do if the choice is explicit.

it is pretty obvious to me this is a legitimate thing to do and is fully supported by postgres wire protocol. it's mentioned clearly here. Screenshot from 2022-02-15 19-40-44

also, in version 14 of postgres, this pipelining is now integrated and fully supported in libpq. Screenshot from 2022-02-15 19-42-47

billywhizz commented 2 years ago

@roji

To add to @vonzshik answer, and to summarize my views (once again):

re. "once again". i understand your position. i just disagree with it.

  • A general purpose driver cannot reliably know whether a user-provided SQL has side effects or not. Like @vonzshik, I don't believe TFB is about crafting drivers whose sole purpose is to achieve high scores in benchmarks without being usable in the real world. I don't see what the point of that would be, nor how it's helpful to compare such a driver with real-world driver.

this is your view of what techempower is and seems to be heavily informed by your specific area of interest which is postgres drivers. there are many different implementations here in various stages of development and community adoption (for example, literally nobody is using mine). what you seem to be calling for is much stricter rules on many different aspects of the implementations. maybe you should propose all these new rules in one place and see what the community and the maintainers think? personally, i like the fact that techempower is open to new and unproven frameworks and i've learned a lot from investigating the different implementations and the optimizations they have made. isn't that the whole point? it's an opportunity to learn? it seems to me you have a very different view that it should be purely a marketing tool for more mature and well-backed frameworks and libraries like the ones you work on.

  • If you're assuming read-only queries, then I don't think it matters whether prepared queries or binary numeric data is used. Anything can be safely rolled back without consequences, since there are no side effects. The problem is that the read-only assumption can't be made in the real world.

this read only assumption absolutely can be made in the real world. if you know the requirements up front then you can implement for this, as long as your database driver allows it and handles it correctly. i don't get how you can make the assertion that this approach is not applicable in the real world. also see my point above about this technique being fully supported in postgres wire protocol and latest release of client libraries. did they do all that work just so nobody could use this technique in the "real world"?

  • As an aside, even in read-only scenarios, errors can occur. An unoptimized query or a missing index can cause a timeout, for example.

a timeout in receiving results back on the database connection? sure, and that would be something that should be handled appropriately on the client side. if the previously returned results in the "batch" have been sent back to the client that is ok if the queries in the batch have no side effects. again, you are making claims that simply aren't true and have yet to provide a concrete example of how this approach will fail within the parameters of the techempower tests. if you can do that, then i'd be prepared to take your arguments more seriously.

  • "little to no risk involved" is a problematic bar of quality for a database driver... Ask real-world users if they'd accept "little risk of data changes silently disappearing", in order to get slightly better perf, and I don't think you'll find many takers.

"once again", this is just innuendo. you are making assertions about potential failure modes in a general technique being used here that you are unable to demonstrate. this is all on top of the opening post claiming the implementations cited violated the current rules when that is patently untrue.

Brar commented 2 years ago

@billywhizz I have the impression that we have reached a point where further discussion is pretty much pointless. You have stated your point of view and others have too and there is some fundamental disagreement and little learning opportunity left for all of us.

this is your view of what techempower is

It is also what my view of what TechEmpower is but that's not important. What is important is what TechEmpower's view is. They've stated some of it at https://www.techempower.com/benchmarks/#section=intro and at https://www.techempower.com/benchmarks/#section=motivation and I read things like "Each framework is operating in a realistic production configuration. " and "Choosing a web application framework involves evaluation of many factors." but this are random picks that I chose to support my point of view that some realistic usage is part of the goals and you might as well find parts that support your point of view.

The longer this thread becomes and the more we're discussing opinions instead of technology the harder it will become for people caring about technology to find their way through it.

I'm out now, waiting for TechEmpower's decision on this.

NateBrady23 commented 2 years ago

Sorry, everyone, we've been swamped with client work. There's a lot to catch up on here. Something I will point out after reading the last few posts: we do expect that a framework permutation will eventually implement all of the tests. That's to say, if you're relying on a driver optimization that only reliably works for the multiple queries test and shouldn't be used for the updates test, that wouldn't belong here.

Though, I do agree with @billywhizz that we don't want to stifle innovation by making the test rules too strict. However, it's important to note that the whole idea here is seeing how certain configuration changes affect all aspects of the framework. i.e. if you tweak this configuration it helps multiple query reads but slows down updates, etc.

roji commented 2 years ago

Here are some answers to the stuff I find most important...

First, just to be clear, I agree that for read-only workloads (no side-effects), this technique isn't risky per se. Nobody is/was claiming otherwise.

i didn't say this. these are choices a developer can make and options the driver can offer. it's perfectly legitimate for the developer to decide to have a connection which has only queries with the properties above while having a different connection that has different properties (like sync on every statement).

The problem with this - looking at it from a production/safety point of view - is that if someone gets this wrong and does accidentally do something with a side-effect, then you're in a very dangerous situation where the potentially massive error (again, lost data) will go unnoticed. I'm totally for low-level perf opt-ins, but as responsible library developers, we have to also take into account what happens when someone accidentally gets it wrong.

In other words, the question here is whether any behavior can be considered OK/safe as long as it's hidden behind an opt-in - no matter how dangerous. I don't believe so.

do you really believe the techempower results have that big an influence on people's choice of database driver?

Yeah, I do (or on programming language/whatever; see @Brar's quotes from above which seem to support this as a goal of TE).

But even if we ignore that for a moment, at the end of the day, TE shows different drivers/frameworks, sorted by RPS; what exactly is a user supposed to understand by looking at this list? There's obviously no way to know that driver X is significantly faster because they're using a special opt-in technique, that both has dangerous consequences when used wrong and only works for read-only queries. So the results become very hard to actually understand in a useful way.

So yes, I do agree with you that at the end of the day it's a question of what TE is about. If it's a playground for unsafe experimentation that isn't supposed to be production-ready, that's one thing; but as @Brar showed above, it seems that the idea is to show "realistic production configurations", which I don't believe the above is compatible with.

given the dominance of voices from MSFT/ngpgsql community on here i think it's more about you being able to continue to promote your framework using the techempower scores - something you have done multiple times in the past and is in fact even mentioned on your github landing page.

To be clear, we (the Npgsql team) could save ourselves the trouble of this discussion and just implement this feature, just like you have, and make a huge jump in the scores. It wouldn't even be particularly difficult for us to do so. It's just that we've discussed this quite a lot and believe this is a dangerous feature to have in a real-world database driver.

also see my point above about this technique being fully supported in postgres wire protocol and latest release of client libraries. did they do all that work just so nobody could use this technique in the "real world"?

The reason Sync-batching is supported at the wire level is IMHO so that explicit batching/pipelining can be implemented, which is perfectly fine and has nothing to do with this. The wire protocol doesn't tell you whether you can/should use this capability for mixing unrelated commands together in a single batch.

[...] this is all on top of the opening post claiming the implementations cited violated the current rules when that is patently untrue.

Regardless of all of the above, the 7th general requirement rule states: "However, it is permissible to pipeline traffic between the application and database as a network-level optimization". I agree that there's some vagueness there which could be interpreted (that's why we're asking for clarification), but IMHO Sync-batching is definitely not a network-level optimization, since it affects protocol and transactionality/error semantics.

billywhizz commented 2 years ago

@roji

In other words, the question here is whether any behavior can be considered OK/safe as long as it's hidden behind an opt-in - no matter how dangerous. I don't believe so.

is this the question? this is why i am a little exasparated. you keep changing the question you are asking and moving the goalposts. i thought we were talking about something very specific - namely whether batching/pipelining of statements on the wire for postgres database requests is ok or not. then it became whether cross request batching of the sync messages is ok. it would be useful for everyone if someone from your group stated again clearly what change they are proposing and provided some clear evidence (note: not opinion) as to why this change would benefit the community and the project. i've already asserted that the behaviour here is safe for the techempower tests and i don't see any clear refutation of that.

you are the ones proposing changes to the rules after years of them existing as they are today, so i think the bar should be pretty high for making that case. maybe it would be better if some resources were dedicated to adding better/more compreshensive/less "gamable" tests to the existing suite? i'm all for that as i think existing tests are not very realistic and tend to stress RPS over all else. not meaning to be critical of the TE team here at all - is great to have this service provided for free and wish it were better funded and could grow some more.

But even if we ignore that for a moment, at the end of the day, TE shows different drivers/frameworks, sorted by RPS; what exactly is a user supposed to understand by looking at this list? There's obviously no way to know that driver X is significantly faster because they're using a special opt-in technique, that both has dangerous consequences when used wrong and only works for read-only queries. So the results become very hard to actually understand in a useful way.

but, surely it is up to library/framework developers and team members to do this analysis and discuss/explain the results and the benefits/tradeoffs of the different approaches? again, i think it comes back to very different views of what TE is for. i see it mostly as a community learning tool and think any rankings should be taken with a large pinch of salt, like any microbenchmarks.

If it's a playground for unsafe experimentation that isn't supposed to be production-ready, that's one thing;

have you looked at many of the frameworks here? it's pretty clear many of them are new and not production ready and many are experimental or are using various "unusual" techniques in order to improve perf. again, what specific proposal do you have here and what kind of effect do you think that would have on the submissions if it were implemented?

maybe a good proposal you could make here would be a new flag for "production ready" or "experimental". i'd be fine with that and i'd certainly classify my own as not production ready and experimental at this stage. but it is "realistic" given all the libraries it depends on are having to be built from the ground up and it's not even near a 1.0 release yet. if it was easy to filter on this flag then folks could easily narrow their choices down to "production ready" frameworks and TE folks could decide if they want to only include those ones in the regularly (or not so =)) published roundups?

given the dominance of voices from MSFT/ngpgsql community on here i think it's more about you being able to continue to promote your framework using the techempower scores - something you have done multiple times in the past and is in fact even mentioned on your github landing page.

To be clear, we (the Npgsql team) could save ourselves the trouble of this discussion and just implement this feature, just like you have, and make a huge jump in the scores. It wouldn't even be particularly difficult for us to do so. It's just that we've discussed this quite a lot and believe this is a dangerous feature to have in a real-world database driver.

ok, so basically you have decided you don't want to expose postgres pipelining to the developer so everyone has to abide by your decision?

also see my point above about this technique being fully supported in postgres wire protocol and latest release of client libraries. did they do all that work just so nobody could use this technique in the "real world"?

The reason Sync-batching is supported at the wire level is IMHO so that explicit batching/pipelining can be implemented, which is perfectly fine and has nothing to do with this. The wire protocol doesn't tell you whether you can/should use this capability for mixing unrelated commands together in a single batch.

no, it doesn't, so we're just going on your opinion/preferences then? as i said above i agree with you that this isn't a reasonable approach for all scenarios but as i have also said before my opinion is that it can be done safely for all of the techempower tests as they currently stand, even the update test as long as an explicit sync is allowed on the actual update statement.

[...] this is all on top of the opening post claiming the implementations cited violated the current rules when that is patently untrue.

Regardless of all of the above, the 7th general requirement rule states: "However, it is permissible to pipeline traffic between the application and database as a network-level optimization". I agree that there's some vagueness there which could be interpreted (that's why we're asking for clarification), but IMHO Sync-batching is definitely not a network-level optimization, since it affects protocol and transactionality/error semantics.

the problem with your argument here is the tests as they currently stand allow this kind of sync/cross-request batching to be done and for the solution still to be safe. imho, this is perfectly safe to do for all the tests where the queries are readonly given the preconditions discussed in various places above. so, the only question is how do we also handle the update test. here is some pseudocode/work in progress that might make that clearer:

const pg = await postgres.createSocket(config)
// enable pipeline mode for this connection, meaning syncs will automatically be 
// sent for each batch of commands put on the wire for postgres
pg.pipeline = true 

// this command will not have sync automatically appended by default as the connection 
// has pipeline mode enabled. if pipeline mode was disabled, it would
const getWorldById = await pg.compile(worlds)

const updateWorlds = await pg.compile(updateWorlds)
// because we set sync = true on this command, it will always have a sync appended 
// to the exec when it is called
updateWorlds.sync = true

// this will put 20 Bind/Exec pairs in the buffer which is flushed onto the wire for postgres 
// if there are no pending responses or buffer is full
const worlds = await Promise.all(spray(20, getRandomWorld))
worlds.forEach(world => world.randomnumber = getRandom())

// this will put a Bind/Exec/Sync in the buffer so it and all preceding queries since the last sync 
// will be part of a transaction/syncpoint
await updateWorlds(worlds)

again. this is perfectly safe. each batch is explicitly tied to a http request and if any query in a batch fails, then the driver can hand an error back which can be handled safely?

we will end up with this on the wire

BIND
EXEC
... 19 other queries
BIND
EXEC // update
SYNC
BIND
EXEC
... 19 other queries
BIND
EXEC // update
SYNC
...

is that wrong/broken/unsafe?

for multi-query test, we could decide whether to do a sync on every http request or not. for the individual query tests (db and fortunes), if we did pipelining, then it would look like this Screenshot from 2022-02-16 01-16-44

this is, in effect, using the sync mechanism, where it is known the queries have no side effects, to optimise the network interaction with the postgres wire protocol. are you saying, at scale, it would be acceptable to justify paying for 30-50% more resources to service the load just because we don't want to implement some logic client side to retry a failure when the client receives a transaction rollback error, which is highly unlikely to ever happen anyway?

not sure if i can contribute much more to the discussion. i think it is really your move to either provide a more compelling argument for why this kind of optimisation should be disallowed or to modify your proposal to be less disruptive and overbearing for the existing community of implementers and maintainers.

ajcvickers commented 2 years ago

What about if a network error occurs? I would expect a driver to potentially:

However, unless I've opted into a fire-and-forget mode, which isn't the case here, I would consider it a driver bug if it:

This last case is what we are talking about. Regardless of the chance/conditions required to generate the error, I don't believe a driver should provide acks when data has not been finally committed to the database.

roji commented 2 years ago

is this the question? this is why i am a little exasparated. you keep changing the question you are asking and moving the goalposts. i thought we were talking about something very specific - namely whether batching/pipelining of statements on the wire for postgres database requests is ok or not.

That seems... disingenuous... I've repeatedly written above that IMO batching in itself isn't unsafe, as long as it's explicit, but that implicit, cross-request batching is not. In your response, you wrote:

it's perfectly legitimate for the developer to decide to have a connection which has only queries with the properties above while having a different connection that has different properties (like sync on every statement).

... which I understood to mean the you consider this behavior safe, as long as the developer opts into implicit, cross-request batching. That is what raises the question of whether an dangerous feature is OK as long as it's behind an opt-in.

it would be useful for everyone if someone from your group stated again clearly what change they are proposing and provided some clear evidence (note: not opinion) as to why this change would benefit the community and the project.

I think this has been done (numerous times) above. You may not be convinced - which is OK - but that isn't to say it hasn't been clearly stated. To attempt yet again in tight form, implicit cross-command batching is unsafe because it can cause effects of earlier commands to disappear silently on failure. This is especially bad without clear user opt-in, and from a cursory check, some/most of the top frameworks don't have this - drivers implicitly batch by default! But even with a clear opt-in I view this as quite dangerous, because of what happens when users get it wrong. I don't know how I can make the above clearer. In addition, beyond safety I think that such a fundamental optimization that's only reliable for a specific workload (i.e. read-only) is problematic in these benchmarks (see @nbrady-techempower comment above).

To drive the point home, here's a thought experiment from @vonzshik: let's consider a feature where the driver recognizes the same query SQL, and caches query results - a form of memoization. Now, such an optimization would obviously make things extremely fast, and a driver implementing it would shoot to the very top in the Fortunes benchmark. This is specifically prohibited in rule 7 ("In all cases where a database query is required, it is expected that the query will reach and execute on the database server."). Now, someone could claim this is a legitimate driver optimization which could help in the real world, but is this something TE should allow? I'd argue against it, since the point of the scenario isn't to show the best perf for some very narrow scenario where some specific optimization is possible (same query over and over), but rather to give an idea of general driver perf, where interacting with the database is unavoidable.

i think it comes back to very different views of what TE is for. i see it mostly as a community learning tool and think any rankings should be taken with a large pinch of salt, like any microbenchmarks.

That's not how I read the Motivation and questions page - not especially the "realistic" vs. "stripped" category. There's definitely a place for "non-realistic/potentially unsafe" experiments, and I have nothing against them, as long as they're clearly marked as such. But there's a problem when we start to mix experiments and learning tools with drivers that hold themselves to production-grade reliability/safety standards - the comparison becomes meaningless, and IMHO indeed, potentially dangerous.

maybe a good proposal you could make here would be a new flag for "production ready" or "experimental".

I agree this is a good idea - and does exist to a certain extent via the realistic/stripped distinction. Though there still needs to be a common set of rules which specify what's allowed and what isn't, and those same rules would need to apply to both production-ready and experimental, right? Otherwise, experimental just becomes a place where you don't have to comply with any rules. In other words, I'm not sure it would make much sense to allow something in experimental that wouldn't ever be allowed in production-ready.

ok, so basically you have decided you don't want to expose postgres pipelining to the developer so everyone has to abide by your decision?

You seem to be consistently mis-representing my position here (or misunderstanding it). I've already written several times that IMHO PG pipelining is totally fine as long as it's explicit for a given batch, i.e. I as the developer hand the driver a set of SQL command and tell it to batch them (see this comment). That's very different from the driver batching unrelated SQL commands from different origins/HTTP requests - even if the user opts into that.

Regardless of all of the above, the 7th general requirement rule states: "However, it is permissible to pipeline traffic between the application and database as a network-level optimization". I agree that there's some vagueness there which could be interpreted (that's why we're asking for clarification), but IMHO Sync-batching is definitely not a network-level optimization, since it affects protocol and transactionality/error semantics.

the problem with your argument here is the tests as they currently stand allow this kind of sync/cross-request batching to be done and for the solution still to be safe. imho, this is perfectly safe to do for all the tests where the queries are readonly given the preconditions discussed in various places above.

You misunderstood my point. Rule 7 (and other rules) prohibit certain things regardless of whether they're safe or not - that's another discussion. TE is setting some rules on what it is they'd like to see benchmarked, to make sure things stay apples-to-apples, and so that we benchmark what we actually want to benchmark (i.e. actual database access). If drivers start to do memoization (as above), or to coalesce multiple SQL commands into one, we're leaving that common territory and comparing apples and oranges.

imho, this is perfectly safe to do for all the tests where the queries are readonly given the preconditions discussed in various places above. so, the only question is how do we also handle the update test.

Here again, I don't think that's the discussion - I know (and have known from the beginning...) that specifically for these benchmarks (including the updates one), this technique won't cause issues. The problem, as I've already written, is that these techniques are tailored specifically for these TE scenarios, and would be dangerous in the general case in the real world. And as such, the question is whether it makes sense for TE to allow it.

billywhizz commented 2 years ago

@ajcvickers

This last case is what we are talking about. Regardless of the chance/conditions required to generate the error, I don't believe a driver should provide acks when data has not been finally committed to the database.

nobody here is suggesting acknowledging uncommitted data to the web client is a good idea or should be allowed so i don't really understand what this adds to the conversation.

billywhizz commented 2 years ago

@roji

... which I understood to mean the you consider this behavior safe, as long as the developer opts into implicit, cross-request batching. That is what raises the question of whether an dangerous feature is OK as long as it's behind an opt-in.

yes. after investigating, i consider it safe to send multiple execs from different http requests to the database as a pipeline with a single sync as long as developer has opted into this behaviour and is aware of the behaviour, namely:

i.e. this is safe and perfectly compliant with postgres wire protocol and a valid optimization

sequenceDiagram
    autonumber
    client->>web: GET /db
    client->>web: GET /db
    web->>db: BIND<br>EXEC (1)<br>BIND<br>EXEC (2)<br>SYNC
    db->>db: process
    db->>web: ROW (1)<br>COMPLETE<br>ROW (2)<br>COMPLETE<br>READY
    web->>client: ROW (1)
    web->>client: ROW (2)

i can accept you could argue this is a specific optimization for the TE tests but the reality is this user-definable (as discussed/agreed above) behaviour is safe across all the TE tests as they are today and have been for a long time. so it's general enough to cover all the scenarios in the TE tests as well as being applicable more widely in the field. it doesn't require any trickery to work across all the tests and it can be safely multiplexed onto one postgres connection per web server thread/process. i.e. @matt-42's implementation is safe at a wire protocol level, although he has accepted, as we all seem to, the need for explicit opt-in from the developer.

i think it's important to stress that, given postgres process per connection model, it's only possible to drive the postgres process on the other end of the connection to capacity by doing sync batching in this way for small/fast queries. if you force a sync/flush on every sql command then you effectively throttle the postgres process. in this scenario, your only option for scaling is by using multiple connections per web server process which causes many more postgres processes to be created (no. of web servers times no. of connections per web server) and results in much higher contention for all resources on the databsase server. surely uncovering these kind of safe, compliant optimizations that allow better utilization of resources is a large part of why TE benchmarks exist?

But even with a clear opt-in I view this as quite dangerous, because of what happens when users get it wrong.

i'm not sure how this is relevant. there are different frameworks with different levels of abstraction and different levels of hand-holding and protection for the developer. you seem to be saying we should only allow one kind of framework and exclude any frameworks that allow lower level access or more control/options for the developer.

I think that such a fundamental optimization that's only reliable for a specific workload (i.e. read-only) is problematic in these benchmarks

it seems you haven't actually read what i have written (the pseudocode above is i think pretty clear?). you can still incorporate commands with side effects (e.g. in the /update test) if they are followed by a sync and this is something that can be controlled by the developer. you even admit later on in this reply:

I know (and have known from the beginning...) that specifically for these benchmarks (including the updates one), this technique won't cause issues

To drive the point home, here's a thought experiment from @vonzshik:

this is just a completely random scenario you have interjected into the conversation here. this kind of caching has nothing to do with what we are discussing, which is using the postgres wire protocol as designed.

That's not how I read the Motivation and questions page - not especially the "realistic" vs. "stripped" category.

realistic v stripped is, in my understanding, more about an idomatic/common implementation of the web framework code versus a hand crafted/low level one that would not be a common pattern for developers using that framework or language. to me, this is not the same as saying in general this framework is experimental or production ready. the reality here is that anybody deciding to adopt any particular framework is going to do that research themselves before coming to a decision. at least i hope they are.

the comparison becomes meaningless, and IMHO indeed, potentially dangerous.

there it is again. "dangerous". more innuendo without evidence. i don't think the comparison is meaningless. i think it's all really interesting and educational.

I'm not sure it would make much sense to allow something in experimental that wouldn't ever be allowed in production-ready.

well yes. this is why i believe we should be flexible/lenient in what we allow. the goal of the exercise imo is to learn about performance and its change over time and the various tradeoffs between different platforms, frameworks and approaches. for something like this, openness is important and is i think a large part of the continued interest in these benchmarks. if we introduce too many rules and make it too strict, then where are the boundaries being pushed and what do we learn over time? do we have to wait for MSFT or someone else to tell us what's allowed and what isn't? doesn't this just kill the innovation and experimentation happening here?

TE is setting some rules on what it is they'd like to see benchmarked, to make sure things stay apples-to-apples, and so that we benchmark what we actually want to benchmark (i.e. actual database access).

but, we are benchmarking actual database access. you seem to be implying that the technique we are discussing here is somehow skipping sql commands when it's patently not. this is explicitly tested for by TE toolset.

If drivers start to do memoization (as above)

again, you introduce memoization when literally nobody is talking about doing this. this seems disingenuous to me.

or to coalesce multiple SQL commands into one

there is no coalescing of multiple SQL commands into one happening here. all the commands are individual exec messages and are run individually in the server process and results are built/buffered as they are run and flushed to the wire when the sync is encountered or the buffers are full. this is all perfectly acceptable usage of the postgres wire protocol as far as i can see.

comparing apples and oranges.

you've said yourself that you could submit an entry with these optimizations. why not do so and give it a different name as @matt-42 has done? that seems fair to me and tbh i would like to do the same because i would like to see the perf difference in TE environment given my framework does not currently do this kind of cross request batching. then you can explain to your users the benefits, tradeoffs and dangers of the different approaches and you can see where there may be room for further optimizations.

Here again, I don't think that's the discussion

no, because you keep changing the discussion.

I know (and have known from the beginning...) that specifically for these benchmarks (including the updates one), this technique won't cause issues

you've known from the beginning but you continue to assert this technique is "not safe".

these techniques are tailored specifically for these TE scenarios

i think this whole thread really comes down to this argument alone. you favour a much stricter interpretation of this rule by TE which has clearly not been in effect to date. this is also a completely different argument to the initial, specific, objection raised by @sebastienros which asserts the cross request batching is a violation of the rule that all sql commands must be executed individually on the server and "not combined into a single complex query".

so, it's up to folks at TE if they want to tighten the rules and explicitly disallow this kind of optimization. imo that would be restrictive and we would be better if we encouraged the community to adopt a convention as @matt-42 has done of marking the solutions that use this technique with a "-batch" suffix or something similar.

the key, hopefully final, point i want to make here is that in all the scenarios in TE using this technique, you are never acknowledging data back to the web client that has not been committed to the database. i'm happy to be proved wrong on this but i'm pretty sure i'm not.

roji commented 2 years ago

Stepping back and trying to be crystal clear: there's the question of whether implicit batching of unrelated commands is safe, and there's the question of whether it should be allowed in the benchmarks, according to the rules. The two questions are related, though they are separate.

Yes, I agree (and always have) that there are specific scenarios where implicit batching is safe - read-only SQL chains, and also read-only chains followed by a single update, after which comes the Sync (the latter scenario isn't something we've been discussing recently, so I haven't been mentioning it). You are arguing that users should be allowed to explicitly opt into this when they know their workload happens to match the above; I'd avoid this in a production, real-world driver since the consequences of getting it wrong are disasterous (silent data loss). Even if this discussion only results in some drivers putting this behind an opt-in (which seems to be the case), then I feel something good has already come out of this.

But in any case, every driver writer does and will do what they think is right, which is where we come to the 2nd question - should TE allow implicit batching.

My view of the TE benchmarks, is that they're supposed to show you typical or standard performance for a given scenario; Fortunes is supposed to tell you e.g. how many RPS a framework does when executing a single database query. Allowing the above optimization would instead show RPS numbers for queries which are known to be side-effect free; that's quite a restriction: we're no longer discussing the perf of any single query, but of a very specific subset of queries. Now, it's true that Fortunes happens to be compatible with this optimization from a safety point of view, but IMHO the idea is to use Fortunes as a proxy for general query performance; otherwise the result becomes rather meaningless.

This is what I meant with my "ad absurdum" example memoization - it's just another example of a possible driver optimization. Although it could be a valid thing to do in the real world (no potential safety issues in any case), this isn't something that we want used in the benchmarks, since it doesn't measure what we want (general database query performance): it would provide us with results that are only relevant when the same queries are repeated over and over again (and lagging results are OK). In the same way, implicit batching provides us with results that are only relevant when queries are known to be 100% side-effect-free.

To summarize, not everything that can speed up perf is something we necessarily want allowed in these benchmarks, and just like TE already disallows memoization and coalescing of multiple SQL commands into one, I believe it should disallow this optimization, which is only safe for use in very specific scenarios as we've already agreed. I don't think this should necessarily limit innovation or experimentation - but it would make sure that we're showing apples-to-apples numbers for the same generalized scenarios.

I hope the above accurately (and respectfully!) represents the state of the discussion and our respective positions.

michaelhixson commented 2 years ago

I'd propose something along the lines of:

If any pipelining or network optimization is performed, then the execution behavior and semantics must match what they would be if the multiple SQL statements were executed without the optimization, i.e. in separate roundtrips. For example, it is forbidden to perform any optimization could change the transactionality and/or error handling of the SQL statements. To avoid any doubt, such optimizations are forbidden even if they wouldn't produce the behavioral change with the specific SQL statements of the TechEmpower benchmarks, but would produce it with other statements a user might supply.

I agree with this proposed requirement.

I tried to articulate this position in a Google Group thread from 2018 -- read from "I bet a compliant pipelining implementation would" onwards. These clarifications never made it into the official requirements, and I think the proposed addition does a good job of that.

It might be worth clarifying with an example for the Postgres protocol specifically, as it sounds like the only way to implement this is to put a sync message between each query.

NateBrady23 commented 2 years ago

We're going to update the requirements with the proposed text above. Thank you everybody for your contributions to this discussion. We won't have any automated way to test for violating this requirement right away, so we'll continue to rely on the community to politely point out the frameworks that do. Thanks again!

roji commented 2 years ago

@nbrady-techempower thanks for your patience with this long thread!

If you do decide to implement verification for this, I think @billywhizz's suggestion above is a good way to do that:

it should be possible to tighten the current checks by verifying the number of txns for each run is same as number of http responses received - so, at least one txn per http response received, no? this would disallow cross request batching for single and fortunes and would allow per request batching for the multi and update queries and make cross-request batching pretty much impossible on those too.

billywhizz commented 2 years ago

sorry to continue this.... =)

this new requirement still seems a bit unclear to me, especially with @roji's response re. my proposed verification which would allow pipelining, but only within each http request. just to try to clarify, is this compliant with the new rule? i think it should be.

for each http request to /db we put at least

BIND
EXEC
SYNC

on the wire

for each http request to /fortune we put at least

BIND
EXEC
SYNC

on the wire

for each http request to /query?q=N we put at least

BIND
EXEC
... repeat N - 1 times
SYNC

on the wire

for each request /update?q=N we put at least

BIND
EXEC
... repeat N - 1 times
BIND
EXEC
SYNC

on the wire

so, essentially, each http request must have an enclosing transaction and transactions/syncpoints cannot be shared across http requests.

OR

are we saying every single sql command must be wrapped in a transaction, so the /query and /update http requests where N = 20 will result in 20 and 21 syncpoints instead of 1?

michaelhixson commented 2 years ago

OR

are we saying every single sql command must be wrapped in a transaction, so the /query and /update http requests where N = 20 will result in 20 and 21 syncpoints instead of 1?

This one.

I'm wondering if there is a way we can make the requirement clearer.

From item 7 under General Test Requirements:

That is, two or more separate queries may be delivered from the application to the database in a single transmission over the network, and likewise for query results from the database back to the application, as long as the queries themselves are executed separately on the database server and not combined into a single complex query.

Would you have asked the same question if we'd added "or transaction" to the end of that?

... as long as the queries themselves are executed separately on the database server and not combined into a single complex query or transaction.

We could also get really specific about Postgres in there. We could add a sentence like this:

If using PostgreSQL's extended query protocol, each query must be separated by a Sync message.

Brar commented 2 years ago

I'm wondering if there is a way we can make the requirement clearer.

After seeing the different opinions around this topic I'd say that it cannot be too clear.

Would you have asked the same question if we'd added "or transaction" to the end of that?

I think that would help to specify the intent you stated above in a rather database-agnostic way, even if it may not completely resolve misunderstanding regarding the behavior around PostgreSQL's Sync message since it's effect on transactions is not documented very clearly and is not totally obvious.

We could also get really specific about Postgres in there. We could add a sentence like this:

If using PostgreSQL's extended query protocol, each query must be separated by a Sync message.

In my opinion that would make it crystal clear, resolve any further misunderstandings about your intent and hit the major issue about comparability of driver implementations that has been discussed here. If you specify it this way, people may challenge your intent and ask the question whether this is a useless limitation you force upon driver implementations but most certainly wouldn't misunderstand it. Depending on the depth of their insight you could point them to this thread to better understand the issues around sync or just tell them that the results of optimizations beyond this point are hard to generalize and are not a goal you are pursuing with the current tests. The compliance of a driver implementation with this rule would also be measurable/provable on the network level (e. g. via wireshark).

billywhizz commented 2 years ago

@Brar @michaelhixson @roji @nbrady-techempower

If you specify it this way, people may challenge your intent and ask the question whether this is a useless limitation you force upon driver implementations but most certainly wouldn't misunderstand it.

i agree this would be much clearer but i think it will need a verification implemented in the tfb toolset to enforce this clarified rule, otherwise nobody can have any real confidence in which frameworks are compliant or not without going and testing themselves. i might try to do some work on this over next couple weeks if nobody at techempower is available right now and you agree it would be worth doing.

PS - this response from @michaelhixson on that google group thread is i think the clearest statement of what is actually required: https://groups.google.com/g/framework-benchmarks/c/A78HrYsz4AQ/m/OfFev6s_AwAJ. i wish it had been posted here earlier. :roll_eyes:

michaelhixson commented 2 years ago

i agree this would be much clearer but i think it will need a verification implemented in the tfb toolset to enforce this clarified rule, otherwise nobody can have any real confidence in which frameworks are compliant or not without going and testing themselves. i might try to do some work on this over next couple weeks if nobody at techempower is available right now and you agree it would be worth doing.

Right now we're trying to avoid creating more code in TFB's existing Python toolset that we'd have to port over to our new work-in-progress toolset. I think this verification is worth adding, but now is not a good time.

If we were going to add this to the existing toolset, my hope is that it would fit in to the "run siege, then query the database for stats" verifications that we have already.

Related code:

I can't tell if Postgres (or any other database) exposes the stats we need here, though. Nothing is jumping out at me from the pg_stat_statements table that we're currently using or from any of the other pg_stat tables.

If anyone can find out where Postgres stores the information we'd need here, that would be useful to us. If the change to the Python toolset would be small enough, we'd consider making it.

roji commented 2 years ago

If anyone can find out where Postgres stores the information we'd need here, that would be useful to us. If the change to the Python toolset would be small enough, we'd consider making it.

So if we're following @billywhizz's suggestion of checking the number of transactions against the number of HTTP requests (and calculated number of DB requests done inside them), then that information should be available in pg_stat_database.xact_commit (docs) (suggested above). You'd sample this value before starting the benchmark, and then ensure that after the benchmark it's equal to the total number of statements that were supposed to be issued.

michaelhixson commented 2 years ago

I made those two small edits to the requirements: "or transaction" and "If using PostgreSQL...".

I tried implementing the verification changes using pg_stat_database.xact_commit, but I couldn't get it working correctly. The xact_commit number seems to lag behind what I'd expect. In manual testing, sending a second request makes the number "catch up" to what it should have been after the first request. In automatic verification, the number is much more than 1 request short and it's very inconsistent. I'm not sure what's going on there.

I pushed my (failing) work in progress to a branch: https://github.com/TechEmpower/FrameworkBenchmarks/compare/master...michaelhixson:batch-query-verification

billywhizz commented 2 years ago

I tried implementing the verification changes using pg_stat_database.xact_commit, but I couldn't get it working correctly. The xact_commit number seems to lag behind what I'd expect. In manual testing, sending a second request makes the number "catch up" to what it should have been after the first request. In automatic verification, the number is much more than 1 request short and it's very inconsistent. I'm not sure what's going on there.

yep. i saw this same behaviour with xact_commit when i tried it a few weeks ago. didn't have time to investigate why. maybe something that should be raised with the pg devs?

Brar commented 2 years ago

Maybe the following information about the statistics collector may help:

When using the statistics to monitor collected data, it is important to realize that the information does not update instantaneously. Each individual server process transmits new statistical counts to the collector just before going idle; so a query or transaction still in progress does not affect the displayed totals. Also, the collector itself emits a new report at most once per PGSTAT_STAT_INTERVAL milliseconds (500 ms unless altered while building the server). So the displayed information lags behind actual activity.

Source: https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-STATS-VIEWS

stephenh commented 1 year ago

I am wary of "yet another opinion", but FWIW I am not a low-level framework optimizer (like @billywhizz :-D and others on this thread), but I am an application-/orm-builder and I could see @billywhizz 's outline of:

for each request /update?q=N -->

BIND
EXEC
... repeat N - 1 times
BIND
EXEC
SYNC

Being done safely & automatically for users of ORMs that use a Unit of Work pattern:

// called for each request to /update
async function handleUpdate(n: number) {
  await uow.transaction(async (uow) => {
    await Promise.all(zeroTo(n).map(async () => {
      const w = await uow.load(World, random());
      w.randomNumber = random();
    }));
    // The ORM diffs/sends UPDATE(s) automatically when lambda finishes
  });

There are:

And so, AFAIU in this model, the per-request pipelining that @billywhizz proposed would actually be safe & kosher?

Even if it's not allowed by TechEmpower, I want to get my ORM to do that someday. :-D

(I get the push back that cross-request pipelining is even more esoteric, although I wonder if an ORM like ^ could still safely take advantage of it (when outside of transactions), again by knowing/controlling the queries, which given the ORM-focus of the TechEmpower benchmark, again seems kosher.)

(...also I wonder if just saying updates?q=N endpoint must use a transaction would more succinctly highlight that each request must use its own connection and not multiplex a single connection; surely that was discussed, but I didn't see it go by.)