cockroachdb / activerecord-cockroachdb-adapter

CockroachDB adapter for ActiveRecord.
Apache License 2.0
103 stars 51 forks source link

Rails transaction retry after rollback #258

Open Piioo opened 1 year ago

Piioo commented 1 year ago

Hallo,

I have not much informations, but maybe you see something I miss.

We have a transaction where we save 7 objects in different tables. On our staging/production system sometimes the transaction commit breaks with an error. The transaction is rollbacked and rails try to retry it.

Our problem is, that 2nd (retried) commit saves only 3 of 7 objects in the DB without any error.

def execute
    @admin = ResourceOwner::User.new(admin_params)
    @company = build_company
    ActiveRecord::Base.transaction do
      @company.save!
      @company.additional_info.save!
      @admin.save!
      ConfirmationToken::PasswordResetToken.prepare_token(@admin, expires_at: nil)
      @admin.add_role(:company_admin, company)
    end
  end

And here the errors we get from the Database:

2022-12-09T07:16:57.974550705+01:00 stdout F {"sql":"COMMIT","name":"TRANSACTION","binds":[],"type_casted_binds":[],"statement_name":null,"exception":["ActiveRecord::SerializationFailure","PG::TRSerializationFailure: ERROR:  restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh): \"sql txn\" meta={id=7caee3f1 key=/Table/72/1/820825936153870355 pri=0.00425340 epo=0 ts=1670566614.790254963,2 min=1670566602.063413353,0 seq=52} lock=true stat=PENDING rts=1670566602.063413353,0 wto=false gul=1670566602.563413353,0\nHINT:  See: https://www.cockroachlabs.com/docs/v21.1/transaction-retry-error-reference.html#retry_serializable\n"],"exception_object":"PG::TRSerializationFailure: ERROR:  restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh): \"sql txn\" meta={id=7caee3f1 key=/Table/72/1/820825936153870355 pri=0.00425340 epo=0 ts=1670566614.790254963,2 min=1670566602.063413353,0 seq=52} lock=true stat=PENDING rts=1670566602.063413353,0 wto=false gul=1670566602.563413353,0\nHINT:  See: https://www.cockroachlabs.com/docs/v21.1/transaction-retry-error-reference.html#retry_serializable\n","duration":45.32,"request_id":"25d5d2166a7737339f1bd0c0f73fd54e","source":"100.65.49.24","tags":["request","exception"],"@timestamp":"2022-12-09T06:16:57.973Z","@version":"1"}

2022-12-09T07:16:57.98031866+01:00 stdout F {"sql":"ROLLBACK","name":"TRANSACTION","binds":[],"type_casted_binds":[],"statement_name":null,"exception":["ActiveRecord::StatementInvalid","PG::InternalError: ERROR:  there is no transaction in progress\n"],"exception_object":"PG::InternalError: ERROR:  there is no transaction in progress\n","duration":5.12,"request_id":"25d5d2166a7737339f1bd0c0f73fd54e","source":"100.65.49.24","tags":["request","exception"],"@timestamp":"2022-12-09T06:16:57.979Z","@version":"1"}

2022-12-09T07:16:58.216634327+01:00 stdout F {"sql":"BEGIN","name":"TRANSACTION","binds":[],"type_casted_binds":[],"statement_name":null,"duration":4.69,"request_id":"25d5d2166a7737339f1bd0c0f73fd54e","source":"100.65.49.24","tags":["request"],"@timestamp":"2022-12-09T06:16:58.215Z","@version":"1"}

After this rails try to retry begins a new transaction (last log message) and saves only token and role. There are no insert queries for admin or company.

Maybe the others has already an id (because token and role needs the id from the admin) and for this reason they are not saved. Or company and admin are in cache and token and role not. Token and Role may be only created when the admin.id is present.

Im trying to write a rspec test, but fail to trigger a retry transaction.

Have you a clue what we may do?

Piioo commented 1 year ago

The Problem is, that some of the objects keep there IDs and persisted status, so that Rails thinks in the retry, I do not have to save them, because they are already persisted

Piioo commented 1 year ago

I think the bug comes form here: https://github.com/cockroachdb/activerecord-cockroachdb-adapter/blob/master/lib/active_record/connection_adapters/cockroachdb/transaction_manager.rb#L13-L21

This code catches the ActiveRecord::StatementInvalid error before Rails may do this in https://github.com/rails/rails/blob/v7.0.4.2/activerecord/lib/active_record/connection_adapters/abstract/transaction.rb#L322-L329

And now the transaction.state.invalidate! and rollback_transaction is not running, so rails do not try to save the objects again.

(Right now its not possible for me to create a test, so I cant fix this :/ )

shfung commented 3 months ago

I think we have encountered the same problem as well, where a db transaction is being retried but the record memory status is not being reset, hence the record is not saved, but the after_save callback was fired.

We think that this problem can be reproduced fairly reliabliy using the following steps:

  1. run the following setup script in a console and don't exit out of the console after that
    
    ActiveRecord::Base.logger = Logger.new(STDOUT)
    ActiveRecord::Base.configurations = {
    "cockroachdb" => {
    adapter: "cockroachdb",
    host: "localhost",
    port: 26256,
    user: "root",
    database: "crdb_rollback_test",
    }
    }
    ActiveRecord::Base.establish_connection(:cockroachdb)

ActiveRecord::Schema.define do create_table :customers, force: true do |t| t.string :name end

create_table :customer_data, force: true do |t| t.belongs_to :customer, null: false, foreign_key: true t.string :city t.datetime :deleted_at end end

class CockroachDbApplicationRecord < ActiveRecord::Base self.abstract_class = true

connects_to database: { writing: :cockroachdb, reading: :cockroachdb } end

class Customer < CockroachDbApplicationRecord end

class CustomerData < CockroachDbApplicationRecord belongs_to :customer after_save :mark_as_current

we want to set all other CustomerData to have deleted_at

def mark_as_current if deleted_at.blank? ids = CustomerData.where(customer_id: customer_id, deleted_at: nil).where.not(id: id).pluck(:id) puts 'mark_as_current sleeping' sleep 5 puts 'mark_as_current sleeping end' CustomerData.where(id: ids).update_all(deleted_at: Time.current) end end end

customer = Customer.create!(name: 'Name') CustomerData.create!(customer_id: customer.id, city: 'SF') customer_data = CustomerData.new(customer_id: customer.id, city: 'NYC')

2. Open a separate console and run the following script to connect to the same crdb database, and and don't exit out of the console after that either
```ruby
ActiveRecord::Base.logger = Logger.new(STDOUT)
ActiveRecord::Base.configurations = {
  "cockroachdb" => {
    adapter: "cockroachdb",
    host: "localhost",
    port: 26256,
    user: "root",
    database: "crdb_rollback_test",
  }
}
ActiveRecord::Base.establish_connection(:cockroachdb)

class Customer < CockroachDbApplicationRecord
end

class CustomerData < CockroachDbApplicationRecord
  belongs_to :customer
  after_save :mark_as_current

  def mark_as_current
    if deleted_at.blank?
      ids = CustomerData.where(customer_id: customer_id, deleted_at: nil).where.not(id: id).pluck(:id)
      CustomerData.where(id: ids).update_all(deleted_at: Time.current)
    end
  end
end

customer = Customer.first
customer_data = CustomerData.new(customer_id: customer.id, city: 'NYC')
  1. Go back to the first console, continue to run the following script
    customer_data.save!
  2. Once you see the 'mark_as_current sleeping' display in the first console, immediately run the following in the 2nd console
    customer_data.save!

    In the 2nd console, you should see the Transaction being committed without any problem But in the 1st console with the sleep, you should see the Transaction being committed but after that being rolled back, then only the two SQL statements from the mark_as_current callback are displayed without any additional "CustomerData Create" If you now run

    CustomerData.where(customer_id: customer.id, deleted_at: nil)

    in either console, it will return an empty array, which is the wrong behavior.

@BuonOmo Can you help look into it with the above steps to reproduce? Thanks

BuonOmo commented 3 months ago

@shfung I reproduce the behaviour you are saying (empty array). I'll make it simpler and in a single script to execute in tests. However, it feels like the behaviour is working quite normally. You are not taking advantage of any transaction, I understand how this could fail using any kind of database.

So two things:

Async Design

run an after_commit script that triggers an async job (I recommend Sidekiq). This job uses a lock by customer_id (redis redlock for instance) runs something like:

def perform(customer_id, cd_id) # should be wrapped in a transaction
  Locker.lock("c#{customer_id}-cd#{cd_id}) do # should raise to re-enqueue job if locked
    ids = CustomerData.where(customer_id: customer_id, deleted_at: nil).where.not(id: cd_id).pluck(:id)
    CustomerData.where(id: ids).update_all(deleted_at: Time.current)
    CustomerData.where(id: cd_id).update_all(deleted_at: nil)
  end
end 

Note that I don't fully understand your use case, as I can see that you can only delete data. And my script is kind of un-deleting. Another solution could be something like:

cd = CustomerData.find(cd_id)
# Only delete older ones.
CustomerData.where(customer_id: customer_id).where("created_at < ?", cd.created_at).update_all(deleted_at: Time.current)

Or just not have this column and use the latest customer data as the valid one...

Simple design

Use an update lock around the customer (depending on your usual queries, this can lead to overhead):

    def mark_as_current
      if deleted_at.blank?
        Customer.find(customer_id).with_lock do
          ids = CustomerData.where(customer_id: customer_id, deleted_at: nil).where.not(id: id).pluck(:id)
          CustomerData.where(id: ids).update_all(deleted_at: Time.current)
        end
      end
    end

Note that the first data created will get the priority in that case. And your could consider a FOR UPDATE NOWAIT lock to avoid overhead and raise directly.


@Piioo I did not see your message, I think it is quite important, and related to the bug I am currently trying to solve in #333. Thank you for the links, I'll have a look at this while finishing my PR. And we'll see if this fixes everything. Basically, the client retry logic is messing with rails internal logic

shfung commented 3 months ago

@BuonOmo Thanks for the suggestions. We notice some places for improvement in our own implementation that we made changes to, but we still believe that there is a underlying retry issue caused by the gem.

Can you elaborate more on what you meant by You are not taking advantage of any transaction, I understand how this could fail using any kind of database?

The after_save callback should be executed in the same db transaction as the save!, that's why when the db transaction is being retried, the after_save callback is being retried as well. We tried the same script under MySQL setup with SERIALIZABLE isolation level, and the race condition didn't happen.

Looking into the code, it seems that the db transaction errored out when it's being committed here in the ensure block of the within_new_transaction method of the TransactionManager object: https://github.com/rails/rails/blob/6f0d1ad14b92b9f5906e44740fce8b4f1c7075dc/activerecord/lib/active_record/connection_adapters/abstract/transaction.rb#L565 When it failes, it raises an

ActiveRecord::SerializationFailure:"PG::TRSerializationFailure: ERROR:  restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE  - failed preemptive refresh due to conflicting locks on...

exception that is caught by the rescue Exception underneath it: https://github.com/rails/rails/blob/6f0d1ad14b92b9f5906e44740fce8b4f1c7075dc/activerecord/lib/active_record/connection_adapters/abstract/transaction.rb#L569-L572 and it will try to rollback the transaction. But this PG::TRSerializationFailure: ERROR is a special error that indicates the db transaction failed and should not be interacted with anymore. So, when the code tries to rollback the transaction, it will get another error:

PG::NoActiveSqlTransaction: ERROR:  there is no transaction in progress

This exception got raised to the patch by the this gem, and triggered the retry: https://github.com/cockroachdb/activerecord-cockroachdb-adapter/blob/6a2a5931ddda20dc4799ca9e46bc1288929f7e19/lib/active_record/connection_adapters/cockroachdb/transaction_manager.rb#L25-L32 Unfortunately, because the rollback failed, ActiveRecord actually didn't rollback the status of the db record in memory. So, when the db transaction is being retried, ActiveRecord thinks that the CustomerData was saved already and didn't insert it into the db, but it still execute the after save callback to delete the other CustomerData rows.

BuonOmo commented 3 months ago

There is a underlying retry issue caused by the gem.

That's for sure, that is the main thing I'm investigating on these days! But I am not sure resolving this issue will fully solve your design problems... These are tricky race condition issues that you'd need to test precisely !

You are not taking advantage of any transaction, I understand how this could fail using any kind of database

Well if there is a wrapping transaction my comment is useless. The reproduction code you send wasn't using any, hence my comment. Maybe your codebase does wrap every http call in a transaction (default behaviour for rails if I remember correctly).