Closed ankane closed 2 years ago
I'd be very interested in some helper method for renaming tables.
Gitlab has these migration helpers for their own internal use. They automagically set up triggers and such in order to do things like seamless column renaming.
Can we add a retry mechanism if migrations fail due to a timeout?
Hey @bheemreddy181, it's an interesting idea.
For statements run outside a transaction, retrying the statement should work:
class AddSomeIndexToUsers < ActiveRecord::Migration[6.0]
disable_ddl_transaction!
def change
add_index :users, :some_column, algorithm: :concurrently
end
end
For statements run inside a transaction, the entire transaction needs to be re-run. For migrations that only use Active Record's DDL transaction and have no side effects, retrying the entire migration should work:
class SimpleMigration < ActiveRecord::Migration[6.0]
def change
add_column :users, :some_column, :string
end
end
However, retrying more complex migrations can cause issues or errors (which may be an acceptable trade-off).
class ComplexMigration < ActiveRecord::Migration[6.0]
def change
add_column :users, :some_column, :string
commit_db_transaction # we can avoid retry if this method is called
begin_db_transaction
add_column :users, :other_column, :string
end
end
Or
class ComplexMigration2 < ActiveRecord::Migration[6.0]
disable_ddl_transaction!
def change
add_column :users, :some_column, :string
connection.transaction do
add_column :users, :other_column, :string
end
end
end
I've added the outside a transaction case for Postgres on the lock_timeout_retries branch if you want to try it. Enable retries with:
StrongMigrations.lock_timeout_retries = 2
@ankane I will give a try and will let you know thanks for this feature, appreciate your work. Your contribution matters a lot.
@ankane I started testing the retry feature started seeing the below error, am I missing something here
@ankane Can you take a quick look whenever you get some time on the above issue, let me know if I am missing anything here?
@ankane Can you take a quick look whenever you get some time on the above issue, let me know if I am missing anything here?
Can I get some help on this testing @ankane
I also get this undef error on strong_migrations_checker
... it was working and now it's not, feels intermittetnt (load order?)
StandardError: An error has occurred, this and all later migrations canceled:
undefined method `strong_migrations_checker' for #<ActiveRecord::MigrationProxy:0x00007fa24ed45038>
/Users/timothykoopmans/.asdf/installs/ruby/2.6.6/lib/ruby/gems/2.6.0/bundler/gems/strong_migrations-716b64e73ceb/lib/strong_migrations/migrator.rb:6:in `ddl_transaction'
/Users/timothykoopmans/.asdf/installs/ruby/2.6.6/lib/ruby/gems/2.6.0/gems/activerecord-5.2.4.4/lib/active_record/migration.rb:1291:in `execute_migration_in_transaction'
/Users/timothykoopmans/.asdf/installs/ruby/2.6.6/lib/ruby/gems/2.6.0/gems/activerecord-5.2.4.4/lib/active_record/migration.rb:1263:in `block in migrate_without_lock'
@bheemreddy181 @timkoopmans Pushed a fix for that error, but running into some weird issues when retrying entire migrations (DB timeouts aren't being applied and the Ruby process hangs), so I wouldn't use the branch in production right now.
@ankane i have seen the same thing while testing this feature , ruby process hangs until i remove my exclusive lock which i am holding from other terminal
Before removing lock
After removing lock
@ankane any thoughts on this retry feature?
Hi @ankane,
we are running into lock timeouts from time to time and would really appreciate the retry feature. I could switch to the lock_timeout_retries branch, but if I understand correctly, this still shouldn't be used in production? Is there anything I can help you with to solve this issue? I can spend some time to create a PR, if I'm able to debug it. Are there any more pointers/information that you came across, that could help? Also, should I open a separate issue for this?
I found the issue :)
StrongMigrations::Checker.set_timeouts
doesn't do anything if @timeouts_set
is already true. That's why no DB timeouts are set on the second try and the migration waits forever to acquire the lock. I can prepare a PR if you want, but basically, setting @timeouts_set = false
again in StrongMigrations::Checker.with_lock_timeout_retries
before actually retrying the block fixed it for me.
This also explains why it's not an issue if no DDL transaction is used, I think, since in that case the DB timeouts won't be rolled back by the transaction?
That makes complete sense - when no DDL transaction is used we don't need a timeout as concurrent index additions will let other transactions to flow through ( will let reads and writes on the table ) . Thanks for looking into it.
We are also hitting lock timeout errors, and are considering alternatives for retries. Thanks for your work, Andrew!
@ankane, why does your implementation only retry when !in_transaction?
? I tried rebasing your branch but get "expected 2 tries and got 1", and it stays out due to that condition.
TimeoutsTest#test_lock_timeout_retries [/Users/tute/Sites/opensource/strong_migrations/test/timeouts_test.rb:125]:
Expected: 2
Actual: 1
90 runs, 154 assertions, 1 failures, 0 errors, 0 skips```
Thank you.
Just added experimental support for lock timeout retries 🎉
@peret That was the issue - thanks!
@tute I don't fully follow the question, but check out the link above and latest test cases for a better idea of how it works.
Also, note that lock_timeout_delay
has been renamed to lock_timeout_retry_delay
from the branch.
If there's any feedback, feel free to create a new issue. Thanks all!
Thanks a ton @ankane for considering this feature.
For
Don't require safety_assured for remove_column if ignored_columns detected (ignored_columns will likely be removed after it runs in production, so this would cause errors for other devs)
I think with safety_assured
we should block migration if the column removed isn't in ignored_columns
. Even if it will still break (adding ignored column in the same deployment), if the deployment was not done before with the ignored_columns
, maybe this warning/raise could help do to not forget to do that.
Something like this:
We detect that :activate_at column is not in
User
'signored_columns
. You should deploy a version with the ignored columns before, and also keep having the ignored columns when running this migration.
At least if it's not blocking the migration. We should have a big warning. I am interested to work on that PR if you are ok @ankane
Hey @benoittgt I think the current approach provides a good balance of safety and simplicity, but feel free to share if you decide to fork and change it for your applications.
For future readers: please create a new issue to discuss any ideas.
I'm still on the fence about many of them. If anyone would like to discuss any of the ideas below, please create a new issue.
Likely ideas
lock_timeout_message
branchtarget_version
and other options -{primary: 10, other: 12}
Other ideas
lock_timeout_retries
branchadd_foreign_key
andchange_column_null
suggestions into a single migration (downside: failed migration can result in db state that needs manually reconciled) -add_foreign_key
may be safe to do since it can detect locksDecided against
safety_assured
forremove_column
ifignored_columns
detected (ignored_columns
will likely be removed after it runs in production, so this would cause errors for other devs)safety_assured
forremove_index!
,execute!
, etc (no good reason to change for now)rails g
commands to generate safe migrations (likegindex
gem)remove_column
withtype
,change_column_default
withto
/from
dump_null_constraints
branch (or may be better to useexecute
command) - this shouldn't be needed in Postgres 12+, as well as Rails 6.1+