customink / activerecord-aurora-serverless-adapter

ActiveRecord Adapter for Amazon Aurora Serverless
https://technology.customink.com/blog/2020/01/03/migrate-your-rails-app-from-heroku-to-aws-lambda/
MIT License
66 stars 7 forks source link

ActiveRecord Phase #2.b - Draw The Owl βœοΈπŸ¦‰ #7

Closed metaskills closed 4 years ago

metaskills commented 4 years ago

Following up from #5 (draw two circles), this pull request gets us to 100% passing all ActiveRecord tests πŸ˜ŽπŸ€—πŸ€©πŸ€“πŸ€‘πŸ₯°πŸ˜… and means the adapter is ready for Rails v6.x business. Next steps:

High Level Fixes

Test Structure

The default behavior of bundle exec rake test is to run all internal & ActiveRecord tests. However, because of the "Flaky Tests & Foreign Key Checks" issues mentioned below, our tests are broken up into chunks. Currently these are:

  1. Our internal unit tests.
  2. Explicit ActiveRecord multi-db tests. 3 Cases.
  3. Highly flaky HABTM (has and belongs to many) test case.
  4. Similarly flaky connection handler test case.
  5. All remaining tests in 3 distinct batches.

Since our Aurora Serverless cluster is set to automatically pause itself, it is possible that the cluster cold be asleep when tests need to run. To address this, I have added a test/bin/wakeup command which uses the lambci/cdk docker image to wake up the cluster. GitHub actions uses this as a Wakeup step.

Coerced Tests

Sometimes some core ActiveRecord tests will not work or needs extra per-adapter logic. When I did the SQL Server adapter I never wanted to pollute adapter code/conditions into core rails. To avoid that coupling, I created a simple Coerceable module to load after ActiveRecord's tests are loaded. It works by using method_define? and if needed undef_method to remove any coerced test(s).

Optionally you can define a method to replace a coerced test. Our coerced_tests.rb file makes use of both techniques.

Flaky Tests & Foreign Key Checks

Every now and then a handful ~7 of the ~6,000 tests may fail with an error like this:

ActiveRecord::InvalidForeignKey:
Cannot add or update a child row:
a foreign key constraint fails
`activerecord_unittest`.`authors`,
CONSTRAINT `fk_rails_94423a17a3`
FOREIGN KEY (`author_address_id`)
REFERENCES `author_addresses` (`id`)

Many of ActiveRecord tests use yaml-based DB fixtures. A key part of that process executes a block of code where it disables referential integrity. For MySQL, it checks the @@FOREIGN_KEY_CHECKS variable and calls SET FOREIGN_KEY_CHECKS to return the state back to where it started.

During this work I discovered a key difference with Aurora Serverless - client session state is shared. For example, I was able to create a Aws::RDSDataService::Client, set foreign key checks to 0 and see that the session value was the same on another newly created client object. The initial fix was to set this in our database.yml file since the MySQL adapter will take any variables and use them to set state on newly created clients.

variables:
  foreign_key_checks: 1

Sadly, this did not help with the flaky test failures. I did some major debugging where I logged our client wrapper's object id and transaction state. Please see the gist for details. This showed me that a) we had not multi-client test issues where one client was messing with another and b) the foreign keys check where in the (off) state when the error happened. This let me to believe that Aurora Serverless itself has some small timing issues when disabling referential integrity.

https://gist.github.com/metaskills/3752c66ed2b50599e4b66b7664f1d856

My final tool to combat this is to use Minitest:Retry to both report on current state and force FK checks back to a known state. To add insult to humiliation... it appears Aurora Serverless has just started acting just fine today (2019-12-28) and not failing any tests in both the notoriously buggy AASA_ARHABTM suite or other places. Super. But for now I'm gonna leave the minitest retry in to see if it help if this issue comes back.

Should these worry us? I do not think so. My assumption is this is isolated to just foreign key check session state. If so, I doubt production applications are doing fixture inserts and/or disabling foreign keys for some reason.