customink / activerecord-aurora-serverless-adapter

ActiveRecord Adapter for Amazon Aurora Serverless
https://technology.customink.com/blog/2020/01/03/migrate-your-rails-app-from-heroku-to-aws-lambda/
MIT License
66 stars 7 forks source link

60 second Timeout Issue w/ Hang #20

Open davidplappert opened 3 years ago

davidplappert commented 3 years ago

It appears as if continueAfterTimeout needs to be set.

I keep getting this error: ActiveRecord::StatementInvalid: Request timed out

And the runs including this error are 60 seconds + normal execution time.

This error happens at what appears to random places in the code (no one query is throwing this). However, it normally appears when we are under load, such at happened this morning when we spiked from 22 to 75 average concurrent executions and then to 99 within 60 seconds. The queries that it errors out on, I would consider to be simple select, insert, update. I am able to log in via SQL 3309 and run commands like I normally would without issue, right after the spike started and the errors started flowing. This error only happened to about 1% of the requests during this time period.

Please review the following links: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/data-api.html https://forums.aws.amazon.com/message.jspa?messageID=946929

Please review the below concurrent executions graph. Right when the spike starts, the errors start to flow.

Screen Shot 2021-01-10 at 9 37 06 AM

Also, please review these Serverless RDS metrics. You can see that we are not close to maxing out our cluster and that CPU stays at under 20% the entire time.

Screen Shot 2021-01-10 at 9 38 28 AM
metaskills commented 3 years ago

Interesting. I remember when writing the adapter thinking about the continue_after_timeout (https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/RDSDataService/Client.html#execute_statement-instance_method) because it made perfect sense to add this for long running migrations since they explicitly called it out for DDL statements. I never added it because I was fairly certain Lambda functions should not be doing DDL statements. Especially long running ones.

That said, I never considered that this option being set to true would make sense to be on all the time. Are you suggesting that we add it here (https://github.com/customink/activerecord-aurora-serverless-adapter/blob/master/lib/active_record/connection_adapters/aurora_serverless/client.rb#L37) set to true and that would fix your issues?

davidplappert commented 3 years ago

That would help, yes.

I have also been on the phone with AWS today (I have a business support contract) and part of the issue is also a Data API Rate Limit Per Second

Screen Shot 2021-01-10 at 10 14 45 PM

It sounds like they may be able to increase that for me, but it has not been approved yet. Funny thing, I only got one (or very few) Rate Limit Exceeded when I first hit the limit. As soon as it started hitting, I manually bumped my serverless to 256 ACU, but the issue continued. The symptoms/errors thrown were these:

ActiveRecord::StatementInvalid: The rate exceeds the limit (thew once or very few times right when the issue started) ActiveRecord::StatementInvalid: Statement cancelled due to timeout or client request (was thrown 5% or less) ActiveRecord::StatementInvalid: Request timed out (49%, alternating) ActiveRecord::StatementInvalid: Concurrent connections limit exceeded (49%, alternating)

The CPU on my serverless cluster never crossed 33% and my connections never crossed 575. When this all started, my cluster was at 8 ACU, so I had 1,000 connections at my disposal. That helped prove the issue was with the Data API somewhere.

Also note, in Amazon RDS's log, that each one of these events that say autoscaling, I manually changed the config to, just for notes for others who may see this. Screen Shot 2021-01-10 at 10 19 32 PM

metaskills commented 3 years ago

Thanks, if you think there is something we can do in the adapter, please let me know.