aws / aws-sdk-js

AWS SDK for JavaScript in the browser and Node.js (In Maintenance Mode, End-of-Life on 09/08/2025). The AWS SDK for JavaScript v3 in the browser and Node.js is available here: https://github.com/aws/aws-sdk-js-v3
https://aws.amazon.com/developer/language/javascript/
Apache License 2.0
7.6k stars 1.55k forks source link

RDS Data API does not encapsulate internal postgres-serverless service errors #3387

Closed IhostVlad closed 3 years ago

IhostVlad commented 4 years ago

Confirm by changing [ ] to [x] below to ensure that it's a bug:

When starting using Aurora PostgreSQL serverless in late 2019, everything works well with small spikes with low-load traffic. But when traffic load had been increased, weird things had begun. Parallel launched RDS Data API queries had begun to mix results and errors between each other. Request with SQL Select statement can crash with error from concurrent Update statement. Database often had been crashed with internal errors like "Too many connections", however, it's serverless service and should not discover any internal implementations, and much for should not throw low-level errors to consumers.

Working with Aurora PostgreSQL serverless had been paralyzed until this post had been published on Reddit https://www.reddit.com/r/aws/comments/f6ag4c/rds_data_api_inconsistency_and_weird_behavior/. RDS Data API had fixed most major issues, like query result mixing and hanging internal connections. Service becomes usable, but not everything is simple. Unfortunately just using ExecuteStatement and issuing SQL queries to a database via appropriate API https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/RDSDataService.html#executeStatement-property is not enough.

Since the database is serverless, tons of temporary errors can appear while executing SQL statements. Errors do not correlate with corresponding queries and are just abstraction lack from low-lever layers of AWS's serverless implementation. Although RDS Data API developers can easily perform handling these errors in that API and rethrow only user-land errors to a consumer, Data API throws on service errors too, and developer should segregate that errors and performs custom retries and similar logic.

Some service errors are documented, but many and many errors can be discovered only by manual running queries with high-load. For example, RDS Data API can crash with following errors "Remaining connection slots are reserved", "I/O error occurred", "too many clients already", "Cannot write in a read-only transaction", "canceling statement due to user request" and so on. More than that, some kind of service errors are retriable, and some are not retriable. For example, SQL query crashed with "I/O error occurred" can be retried with some jitter, but retrying "Request timed out" is dangerous - however current SQL result is a failure, SQL query can successfully be executed and committed into a database but has not been transferred into RDS Data API result set, and there is no way to determine it. Of course, nothing of the above is documented.

So, is RDS Data API usable after this all? Major issues with mixing and hanging statements had been fixed, most temporary errors can be wrapped into a jitter loop, which retries queries until success. The most complicated part is handling Statement Timeout errors because there is no information about the successful completion of the SQL query. That's why to Execute Statement wrapper is the four-state function, which returns result row set on success, throw an exception on custom error, perform jitter auto-retry on a service error, and return high-impedance state on Statement Timeout errors. If the application code receives a high-impedance state, it should manually retrieve database state, and repeat or do not repeat the subsequent query.

Unfortunately all above is not enough. Besides that mentioned tiny inconveniences, RDS Data API provides two additional complex behaviors. Firstly, there is a 3-minute limit for an external transaction, and of course, there is no support for internal database SQL transactions. When 3 minutes interval had been expired, the transaction is automatically had been rolled back. What is a trick? If Commit Transaction statement had been hanged up, the situation is like Statement Timeout errors, because there is no way to discover whether has a transaction been committed or auto rolled back via timeout.

Secondly, the external transaction Commit Statement does never fails. Even if the current transaction had been already aborted via incorrect SQL statement, RDS Data API reports about success commit, however, the transaction is automatically rolled back in that case. Since this moment the RDS Data API team had stopped answering or Reddit, and corresponding posts with issue reports are ignored https://www.reddit.com/r/aws/comments/i3gnpz/rds_data_api_weird_behavior_on_statement_timeout/ and https://www.reddit.com/r/aws/comments/g0h4jq/rds_data_api_and_transaction_management/.

Is it enough to make RDS Postgres serverless service unusable? Of course, no. The transaction log can be stored in special tables in special schemas, so when the transaction had been disappeared via timeout, there is a possibility to inspect the transaction log and determine whether the external transaction had been committed or rolled back. Also, the Commit Statement issue can be solved by sending a synthetic query with a dummy query like "SELECT 0" and determining whether the current transaction is a committable state.

Although RDS Postgres serverless provides supports for potentially infinite scale and high-load traffic, RDS Data API does not provide useful methods to work with this good serverless database. All issues mentioned above in the article can be easily encapsulated under the hood of RDS Data API, so application developers may write just program code and do not think about service internal parts.

More discussion at https://www.reddit.com/r/serverless/comments/i4qcz1/aws_postgresql_serverless_tricks_and_caveats/

ajredniwja commented 4 years ago

I believe this not related to the JavaScript SDK but are very specific to the RDS DATA API.

Will be reaching out to the team to discuss these further.

IhostVlad commented 4 years ago

CC @nitesmeh

github-actions[bot] commented 3 years ago

Greetings! We’re closing this issue because it has been open a long time and hasn’t been updated in a while and may not be getting the attention it deserves. We encourage you to check if this is still an issue in the latest release and if you find that this is still a problem, please feel free to comment or open a new issue.