Most of the best-performing frameworks don't survive temporary db connectivity loss

itrofimow commented 7 months ago

TLDR: in the current top20 only 7 implementations survive a temporary database connectivity loss, other either hang, crash, or get stuck with http500.

I decided to become a QA engineer for a day and give frameworks a touch of real-world scenarios, to see how they would react. I did this

Launch the framework
Poke it a bit to warm it up
Restart the database
Poke the framework some more

for every framework in the top20 of the most recent continuous run.

The results i've got:	Framework	Outcome
may-minihttp	hangs
xitca-web	:heavy_check_mark: (https://github.com/TechEmpower/FrameworkBenchmarks/pull/8781)
just	hangs
vertx	stuck with 500/502[^1]
h20	:heavy_check_mark:
redkale	:heavy_check_mark:
officefloor	hangs
drogon	:heavy_check_mark:
viz	stuck with 500/502[^1]
axum	stuck with 500/502[^1]
jobby	hangs
actix	stuck with 500/502[^1]
mormot	:heavy_check_mark:
ntex	hangs
vertx-web	stuck with 500/502[^1]
wizzadro_http	:heavy_check_mark:
salvo	crashes
aspnetcore	:heavy_check_mark:
lithium	:heavy_check_mark:
inverno	hangs

Some of the rules in the spec are a bit vague, intentionally i believe, and Rule 1. of General Test Requirements is definitely one of such rules, but I would be very surprised to learn that the behavior observed qualifies as a "production-grade" in anyone's book.

Having a reliable way to recover from potential errors is likely to be not free performance-wise. Some of the frameworks above come with configurations that behave reasonably in the scenario given, but their reliability cost is not just "not free", it's staggering: Axum, for example, which hits a very impressive 550K RPS and is ranked 9 in the "Single Query" test, has a configuration that survives a database restart. That configurations drops its performance to 90K RPS (x6 degradation[^2]) and rank 262, even below MongoDB-based implementation for the test.

Considering the above, from a perspective of a web-developer who tries to choose a right tool for the job I personally find the performance figures shown misleading.

Curious to hear the community thoughts on the matter.

[^1]: By 502 I mean "closes the client connection right away" [^2]: I think it's not configured optimally, but still

itrofimow commented 6 months ago

I assume you've got your browser tabs mixed up, because otherwise I have not a slightest idea about what orm/rust/redis-question you @Zheby are talking about.

itrofimow commented 6 months ago

cc @Xudong-Huang @fakeshadow @billywhizz @vietj @sagenschneider @fundon @jplatte @jknack @ecruz-te @fafhrd91 @ShreckYe @chrislearn @jkuhn1 as the most recent committers into the corresponding implementations.

Do you find my concerns justified?

jplatte commented 6 months ago

Hey, I do think that if things don't recover from temporary DB issues, that's pretty weird / probably shouldn't fly for the benchmarks.

Re. axum, the one I contributed to, I can say that the code has some issues (though none that I know would certainly degrade performance) but I haven't had that much motivation to take the time required to fix them. Which configurations are you talking about as the slow and fast one?

itrofimow commented 6 months ago

Hi!

This is actually a common theme for every (as far as I can tell) rust implementation: employing tokio_postgres directly is fast and doesn't recover (https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/Rust/axum/src/database_pg.rs#L41), deadpool-based implementations seem to be reliable yet slow (https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/Rust/axum/src/database_pg_pool.rs#L43)

jplatte commented 6 months ago

I wouldn't mind at all having the pg impls removed, personally x) Interesting that sqlx is significantly slower than deadpool again, wouldn't have expected that. Though a lot might have happened since the 2023 results, anyways 🤷🏼

fafhrd91 commented 6 months ago

if techempower team say recovery is require, should easy to add. doubt it would change results on measurable amount

itrofimow commented 6 months ago

@NateBrady23 could you please clarify the rules here?

NateBrady23 commented 6 months ago

Thanks for doing that QA @itrofimow

I agree with everything said here. I would expect a production-grade framework to be able to recover from a temporary db connectivity loss. However, I also agree that it probably won't alter the results.

Still, I'd like to see the frameworks here updated, though we probably wouldn't be adding this into the test suite anytime soon.

I'll leave this issue open for now. If anyone wanders here, it probably doesn't look good if your framework implementation doesn't recover in such a way.

fakeshadow commented 6 months ago

Mandatory reconnectivity is a good rule to add. Though it's unlikely to affect perf unless it's a hot path of your system.

bhauer commented 6 months ago

Some of the rules in the spec are a bit vague, intentionally i believe, and Rule 1. of General Test Requirements is definitely one of such rules, but I would be very surprised to learn that the behavior observed qualifies as a "production-grade" in anyone's book.

I am the original author of rule 1 of the general test requirements. You are right that it is intentionally vague on the specifics of production grade. My thinking at the time was that if there were specific matters, like the one you raise, the community could discuss them in precisely this manner. So thank you for raising the concern!

Our deference to each framework's definition of "production grade" grants us some luxury to avoid dictating the right or wrong ways to build web applications. For example, I would not want to reject a framework or platform whose developers stated something akin to, "It's normal to restart the application server in production if the database server restarts." While I personally would disagree with such a concession, meanwhile, I would continue not using foreign keys in databases and another developer might consider that reckless. I suspect we can all think of weird stuff we've seen being considered production-grade by other developers.

My inclination would be to allow for the current situation but re-emphasize a longer-term goal of the project to offer more detailed implementation notes for visitors on the results web site. Being able to understand that a given framework provides database connection resilience (or does not) with relative ease would be worthwhile.

In other words: don't forbid this arguably risky optimization (or lack of functionality), but call attention to it, and by so doing, nudge the developers to either explain how they recommend handling database outages in a production environment, present an argument about why it's not relevant in their ecosystem, or accept that consumers of the results data will be made aware of the concession.

itrofimow commented 6 months ago

I would expect a production-grade framework to be able to recover from a temporary db connectivity loss.

Thanks for clarification.

However, I also agree that it probably won't alter the results.

About that: pipelining unrelated queries over a relatively small number of connections seems to be instrumental to squeeze as much performance from PostgreSQL as possible in these benchmarks, however most widely used tokio_postgres simply doesn't have a "reconnect" interface, and the author himself recommends to pool the connections.

Pooling connections takes away the possibility to pipeline unrelated queries, so one needs to increase the number of connections to achieve the same throughput. Increasing the number of connections affects the database badly, so one might want to implement some kind of complicated synchronization in user-code to keep the connections count small yet fault-tolerant. Implementing this synchronization in user-code IMO is questionable from a "showcase what the framework can do" point of view.

Curious to learn how folks would overcome this limitation, will see.

P.S. This issue (the github issue I mean, https://github.com/TechEmpower/FrameworkBenchmarks/issues/8790) is not specifically about Rust implementations, it just so happens that they all fail the QA-test for the exact same reason.

edit: typos edit2: wording

itrofimow commented 6 months ago

Given you interpretation, should I rename the issue to "Most of the best-performing frameworks don't survive temporary db connectivity loss", so it doesn't come as strongly worded w.r.t. the rules? @bhauer

bhauer commented 6 months ago

@itrofimow Up to you! That title revision might be more clear about the specific concern, but I don't think the title matters very much.

ShreckYe commented 6 months ago

Hi @itrofimow, I mainly adapted the Vert.x benchmarks into the Kotlin language. Let me help you cc @vietj, @franz1981, and @pmlopes who are the core developers for this.

jkuhn1 commented 6 months ago

Hi @itrofimow

I do agree that production-grade applications should be resilient to DB failure or restart. Now looking at rule 1 (https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview), as you mentioned it is not really obvious that this is required for the performance test (as you mentioned "should" is not "must"). If we want to go on that path, I think it might be good to clarify this, we can keep "should" to be inclusive and yet clearly state that it must be resilient to DB connection loss but this raises some other concerns: TFB must then include a test for this, if we test this what else can we test to make sure application is resilient? (client connection loss? client sending bad HTTP requests...) This can easily lead to test frameworks resiliency and no longer only pure performance. Don't get me wrong this would be a good thing but it would also widen the scope.

Inverno (and I guess many other Java based frameworks) relies on Vert.x SQL client (which is really good). The framework is still under heavy development and I wouldn't use it yet for mission critical applications (and above issue just proved it) but my goal is mission critical production grade so I'll take some time to test this scenario and see if something can be done on my side otherwise I guess @vietj would need to provide some fix or advice (please let me know if I can help).

jkuhn1 commented 6 months ago

Hi,

I've made some tests on Inverno and refreshed a bit my memory on the application for the benchmark test. I'm just posting my findings here because it might help understand why other frameworks based on Vertx SQL client (or not) are also failing (including Vertx).

I decided to create a single DB connection linked to each IO threads (2*n where n is the number of cores) rather than relying on a connection pool which would be the standard setup for a production application. Using a single connection basically allows to pipeline DB requests which as you know has a significant impact on performance (I guess this was already noticed in previous comments).

Considering resiliency was not in scope for such performance test I didn't find it useful to include something that would recreate the connection in case of errors, this would be doable but not really important to assess performances.

I tested above scenario in an application using the standard pool SQL client and I can confirm Inverno is resilient to DB restart, the Vertx SQL client doing its job perfectly by recreating the dead connection.

In a production application, people would most likely use a pool SQL client but that doesn't mean it is not possible to create specific connection like in the TFB app when pipelining has some interests, the only drawback is that you have to handle connection errors on your own.

From what I saw the issue seems to be the same for the Vertx TFB application, tbc with @vietj .

So there's no issue with Vertx SQL client, for the rest I think it depends on what you decide for the TFB rules: do we include resiliency to DB failure as part of the performance test or not? If so I'll provide some fix to make this happen.

Regards

volyrique commented 6 months ago

Pooling connections takes away the possibility to pipeline unrelated queries, so one needs to increase the number of connections to achieve the same throughput. [...] This issue is not specifically about Rust implementations...

Well, given the latter remark, this is simply not true - h2o does both pipelining and pooling, and according to your own results is the fastest framework on average that is also resilient (in the context of this discussion). While the framework is currently configured to use 1 database connection per pool (and there is 1 pool per worker thread, and 1 worker thread per logical processor), there isn't a separate code path for any particular configuration, and the setting can simply be changed by tinkering with the value here. In fact, a configuration with at most 2 connections and up to 16 pipelined queries was tested for a while after commit 1dc68e936a36e6bedfcf7ad734cfa8de85d2eecc, for example. The load balancing between connections could use some work, though - currently it follows a simple greedy strategy.

You are correct that the optimal case for PostgreSQL seems to be 1 database connection per logical processor core, so the optimal configuration is the one from the previous paragraph, given that the benchmarking environment that the TechEmpower team has set up at the moment uses homogenous hardware. However, back in the ServerCentral days (yeah, it has been a while) that was not the case; as a side note, it was also a NUMA environment. Furthermore, when I am testing my code in the cloud, I often bring up powerful client/load generator and database (virtual) machines and a relatively weak application server, in order to alleviate any potential bottlenecks that are outside of the framework. In such asymmetric cases it is useful to have the capability to use more than 1 database connection per pool, so I have implemented it in a way that does not require any code changes to switch.

itrofimow commented 6 months ago

Well, given the latter remark, this is simply not true

You are right, I should've worded it better: by "this issue is not specifically about Rust" I meant the https://github.com/TechEmpower/FrameworkBenchmarks/issues/8790 issue, yet a "pipelining vs pooling" issue is mostly Rust-related.

billywhizz commented 6 months ago

Pooling connections takes away the possibility to pipeline unrelated queries, so one needs to increase the number of connections to achieve the same throughput. [...] This issue is not specifically about Rust implementations...

Well, given the latter remark, this is simply not true - h2o does both pipelining and pooling, and according to your own results is the fastest framework on average that is also resilient (in the context of this discussion).

it's my understanding that postgres pipelining should be disabled for all frameworks in the tests - on every connection to postgres, the client should wait to receive a response before putting another command on the wire. there has been a big discussion already about this and a number of frameworks have had to turn this off, which drastically alters the results.

if there are frameworks currently using postgres pipelining then it should probably be flagged up here and authors notified.

re. resilience, there is a such a broad range of frameworks on here in various states of maintenance, i am not sure adding more rules for the existing tests is useful and it will also likely have no impact on results.

imo, TE exists to test performance/throughput and not much else. it's a yardstick for folks to help uncover performance issues and get a sense of performance differences between different languages, frameworks and approaches. if we want to test for correctness and spec compliance/etc. then i think that is really a whole different ballgame.

volyrique commented 6 months ago

On the contrary, pipelining is explicitly allowed:

... it is permissible to pipeline traffic between the application and database as a network-level optimization. That is, two or more separate queries may be delivered from the application to the database in a single transmission over the network, and likewise for query results from the database back to the application, as long as the queries themselves are executed separately on the database server and not combined into a single complex query or transaction. If any pipelining or network optimization is performed, then the execution behavior and semantics must match what they would be if the multiple SQL statements were executed without the optimization, i.e. in separate roundtrips. For example, it is forbidden to perform any optimization that could change the transactionality and/or error handling of the SQL statements. To avoid any doubt, such optimizations are forbidden even if they wouldn't produce the behavioral change with the specific SQL statements of the benchmarks, but would produce it with other statements a user might supply. If using PostgreSQL's extended query protocol, each query must be separated by a Sync message.

All frameworks that had to be fixed were violating the transactionality requirement, and skipping Sync messages between queries in particular.

billywhizz commented 6 months ago

On the contrary, pipelining is explicitly allowed:

... it is permissible to pipeline traffic between the application and database as a network-level optimization. That is, two or more separate queries may be delivered from the application to the database in a single transmission over the network, and likewise for query results from the database back to the application, as long as the queries themselves are executed separately on the database server and not combined into a single complex query or transaction. If any pipelining or network optimization is performed, then the execution behavior and semantics must match what they would be if the multiple SQL statements were executed without the optimization, i.e. in separate roundtrips. For example, it is forbidden to perform any optimization that could change the transactionality and/or error handling of the SQL statements. To avoid any doubt, such optimizations are forbidden even if they wouldn't produce the behavioral change with the specific SQL statements of the benchmarks, but would produce it with other statements a user might supply. If using PostgreSQL's extended query protocol, each query must be separated by a Sync message.

All frameworks that had to be fixed were violating the transactionality requirement, and skipping Sync messages between queries in particular.

yes. you are right. apologies - it's been some time since i looked at this. :pray: but there is still no verification that this is implemented "correctly" with a sync for every query.

fakeshadow commented 6 months ago

I have added persist db session with auto recovery of TCP connection and cache prepared statements for xitca-web in #8781

@itrofimow Do you mind trying it and possibly update the issue if it passes your test?

itrofimow commented 6 months ago

Hi @fakeshadow

I've tested you PR locally and can confirm that it indeed makes xitca-web reconnect after a DB restart, I've updated the issue accordingly. I like your RwLock approach: seems like the best of both worlds to me, even though it still come with some minor (probably neglectable) overhead.

Thanks for giving fixes a start!

edit: typo

fakeshadow commented 6 months ago

Hi @fakeshadow

I've tested you PR locally and can confirm that it indeed makes xitca-web reconnect after a DB restart, I've updated the issue accordingly. I like your RwLock approach: seems like the best of both world to me, even though it still come with some minor (probably neglectable) overhead.

Thanks for giving fixes a start!

Thanks for the test and update. The lock is with zero contention in happy path when no reconnecting happen so it's just a couple of single threaded atomic read and a single CAS exchange so I would not worry about the perf impact. There is also room for optimization to remove the atomic operations completely as xitca-postgres has a single-thread feature where it can replace atomic and locks with single threaded counterpart of them.

jkuhn1 commented 5 months ago

Hi,

I've just created PR #8865 for Inverno. This should make the test application resilient to DB connection loss. Solution is pretty elegant and should have no impact on performance (connections cached per I/O threads are just recreated on connection loss).

Rgds

elibroftw commented 1 month ago

Closed accidentally?

klaussilveira commented 1 month ago

It seems like it. The problem still persists with plenty of those frameworks just engineered for the appearance of performance.

@NateBrady23 ?

fakeshadow commented 1 month ago

I think the issue is mostly concluded. For mentioned frameworks they would have already responded and submitted patches if they are interested in fixing this issue in bench code. Besides like me and others mentioned there will not be major perf impact caused by a cold path re connectivity of db session.

From what I see unless there is a rule change to db bench there is no way to motivate everyone to make change. (I don't advocate for a change of rule)

klaussilveira commented 1 month ago

Well, the current rules seem to be enough to disqualify such frameworks

All test implementations should be production-grade. The particulars of this will vary by framework and platform, but the general sentiment is that the code and configuration should be suitable for a production deployment. https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview#general-test-requirements

How can a web framework that primarily uses a SQL database be considered production-grade if it fails to simply reconnect after a connection failure?

fakeshadow commented 1 month ago

Well, the current rules seem to be enough to disqualify such frameworks

All test implementations should be production-grade. The particulars of this will vary by framework and platform, but the general sentiment is that the code and configuration should be suitable for a production deployment. https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview#general-test-requirements

How can a web framework that primarily uses a SQL database be considered production-grade if it fails to simply reconnect after a connection failure?

Yea personally I agree with you on this issue and I would happy to see every framework implement db session persist.

That said from what I see here people have drastically different opinion on what "production grade" means and it's really hard and sometimes downright impossible to come to consent. There are a bunch of similar issue left unsolved related to how to define words. It's kinda sad but also normal for a big community.

fakeshadow commented 2 weeks ago

Hi @fakeshadow

I've tested you PR locally and can confirm that it indeed makes xitca-web reconnect after a DB restart, I've updated the issue accordingly. I like your RwLock approach: seems like the best of both worlds to me, even though it still come with some minor (probably neglectable) overhead.

Thanks for giving fixes a start!

edit: typo

Sorry for another ping. Recently I've found a flaw in this approach where user customized types could be changed during db downtime and cause conflict between client cached types and prepared statement.

Therefore xitca-web has moved to a traditional connection pool with #9252 for eventually recovery from db connection lost (There can be some 500 error until the reconnecting kick in). All session cache is rebuilt after reconnecting therefore no artifact left to cause conflict.

I gotta say there is measurable perf regression with the change which is around 1~2% in db related score. And I'm confident to say almost all Rust framework will suffer a perf regression if connection pool approach is required. The severity of regression would be different from framework to framework but majority of the top scorers will see huge regression as they use popular crates like tokio-postgres + deadpool combo.

TechEmpower / FrameworkBenchmarks

Most of the best-performing frameworks don't survive temporary db connectivity loss #8790