elixir-ecto / db_connection

Database connection behaviour
http://hexdocs.pm/db_connection/DBConnection.html
306 stars 113 forks source link

(DBConnection.TransactionError) transaction is already started #272

Closed 0xAX closed 1 year ago

0xAX commented 1 year ago

Hello,

Can somebody explain the meaning of the error in the title. If I understand correctly (but please correct me if I'm wrong) the exception is triggered if somehow begin is executed within Repo.Transaction.

I can reproduce the error with a bit artificial example:

import Ecto.Query
for i <- 1..3 do
  spawn(fn -> 
    r = MyApp.Repo.transaction(fn ->
      MyApp.Repo.delete_all(from m in MyApp.Model)                              
      MyApp.Repo.insert(%MyApp.Model{id: "test_#{i}"})
    end)
  end)
end

with two mariadb instances with galera (the case is reproduced only with two instances). The queries go to the any possible database instance. From the database queries log I see that connection 67 was disconnected:

2022-12-15  9:17:49 67 [Warning] Aborted connection 67 to db: 'mydb' user: 'root' host: '<host>' (Got an error reading communication packets)

Looking at the query log, I see following queries related to connection number 67:

First mariadb instance:

221215  9:17:28     67 Query    BEGIN
            67 Execute  DELETE s0.* FROM `model` AS s0
            67 Query    ROLLBACK

Second mariadb instance:

221215  9:17:15     67 Query    BEGIN
            67 Execute  DELETE s0.* FROM `model` AS s0
            67 Execute  INSERT INTO `model` (`id`,`inserted_at`,`updated_at`) VALUES ('test_1',TIMESTAMP'2022-12-15 09:17:15',TIMESTAMP'2022-12-15 09:17:15')
            67 Query    COMMIT

Both were executed earlier than connection was aborted. Despite artificiality of the example and despite it may causes deadlocks which are kind of expected in this case do I understand correctly that somehow the connection 67 may stay in transaction status and trigger this exception when it will be selected next time or I am wrong? I'd appreciate any hints for debugging this case.

Thank you.

josevalim commented 1 year ago

I think your assessment is right, the next step is to determine who is at fault. Is the actual connection in transaction status, which means galera is the one leaking connection state, or DBConnection/MyXQL thinking the connection is in a transaction while it is not?

I’d start by going through the stack trace and seeing why the error is raised (ie based on which checks and flags).

0xAX commented 1 year ago

@josevalim thanks for the response. That's what I am going basically to do just thinking how to collect details properly. For this moment I see that at least connection within DBConnection has status :transaction on run_begin call:

#PID<0.30859.3> call DBConnection.run_begin(%{__struct__: DBConnection, conn_mode: nil, conn_ref: #Reference<0.1958103445.1858600963.159678>, pool_ref: {:pool_ref, #PID<0.2288.0>, #Reference<0.1958103445.1858732033.144716>, #Reference<0.1958103445.1858600963.159677>, #Reference<0.1958103445.1858994177.216179>, #Reference<0.1958103445.1858600963.159676>}}, {#Function<14.24086428/1 in Ecto.Adapters.SQL.with_log/3>, [checkout: -576450743071387127, checkin: -576450775385102893]}, [log: #Function<14.24086428/1 in Ecto.Adapters.SQL.with_log/3>, queue_target: 200, timeout: 30000, pool_size: 50, pool: DBConnection.ConnectionPool])
#PID<0.30859.3> call MyXQL.Connection.handle_begin([log: #Function<14.24086428/1 in Ecto.Adapters.SQL.with_log/3>, queue_target: 200, timeout: 30000, pool_size: 50, pool: DBConnection.ConnectionPool], %{__struct__: MyXQL.Connection, client: %{__struct__: MyXQL.Client, connection_id: 81, sock: {:gen_tcp, #Port<0.57>}}, cursors: %{}, disconnect_on_error_codes: [1461], last_ref: #Reference<0.1958103445.1858600963.148146>, ping_timeout: 15000, prepare: :named, queries: #Reference<0.1958103445.1858732035.142814>, transaction_status: :transaction})
#PID<0.30859.3> [4] returned MyXQL.Connection.handle_begin/2 -> {:transaction, %{__struct__: MyXQL.Connection, client: %{__struct__: MyXQL.Client, connection_id: 81, sock: {:gen_tcp, #Port<0.57>}}, cursors: %{}, disconnect_on_error_codes: [1461], last_ref: #Reference<0.1958103445.1858600963.148146>, ping_timeout: 15000, prepare: :named, queries: #Reference<0.1958103445.1858732035.142814>, transaction_status: :transaction}}
#PID<0.30859.3> call DBConnection.status_disconnect(%{__struct__: DBConnection, conn_mode: nil, conn_ref: #Reference<0.1958103445.1858600963.159678>, pool_ref: {:pool_ref, #PID<0.2288.0>, #Reference<0.1958103445.1858732033.144716>, #Reference<0.1958103445.1858600963.159677>, #Reference<0.1958103445.1858994177.216179>, #Reference<0.1958103445.1858600963.159676>}}, :transaction, {#Function<14.24086428/1 in Ecto.Adapters.SQL.with_log/3>, [begin: -576450743071351476, checkout: -576450743071387127, checkin: -576450775385102893]})
#PID<0.30859.3> [10] returned DBConnection.status_disconnect/3 -> {:error, %{__exception__: true, __struct__: DBConnection.TransactionError, message: "transaction is already started", status: :transaction}, {#Function<14.24086428/1 in Ecto.Adapters.SQL.with_log/3>, [begin: -576450743071351476, checkout: -576450743071387127, checkin: -576450775385102893]}}
#PID<0.30859.3> [33] returned DBConnection.run_begin/3 -> {:error, %{__exception__: true, __struct__: DBConnection.TransactionError, message: "transaction is already started", status: :transaction}, {#Function<14.24086428/1 in Ecto.Adapters.SQL.with_log/3>, [begin: -576450743071351476, checkout: -576450743071387127, checkin: -576450775385102893]}}

I am not sure how to to figure out how db itself considers connection but I'll try to find.

0xAX commented 1 year ago

@josevalim Seems after query within transaction is failed because of deadlock the transaction status is not changed.

For example successful commit handling:

#PID<0.14832.0> call MyXQL.Connection.handle_commit([log: #Function<14.24086428/1 in Ecto.Adapters.SQL.with_log/3>, queue_target: 200, timeout: 30000, pool_size: 50, pool: DBConnection.ConnectionPool], %{__struct__: MyXQL.Connection, client: %{__struct__: MyXQL.Client, connection_id: 1508, sock: {:gen_tcp, #Port<0.19>}}, cursors: %{}, disconnect_on_error_codes: [1461], last_ref: #Reference<0.2627703310.817364994.42680>, ping_timeout: 15000, prepare: :named, queries: #Reference<0.2627703310.817496065.31979>, transaction_status: :transaction})
#PID<0.14832.0> [449] returned MyXQL.Connection.handle_commit/2 -> {:ok, %{__struct__: MyXQL.Result, columns: nil, connection_id: 1508, last_insert_id: 0, num_rows: 0, num_warnings: 0, rows: nil}, %{__struct__: MyXQL.Connection, client: %{__struct__: MyXQL.Client, connection_id: 1508, sock: {:gen_tcp, #Port<0.19>}}, cursors: %{}, disconnect_on_error_codes: [1461], last_ref: #Reference<0.2627703310.817364994.42680>, ping_timeout: 15000, prepare: :named, queries: #Reference<0.2627703310.817496065.31979>, transaction_status: :idle}}

the transaction_status is set to :idle afterwards.

here is the failed one:

#PID<0.14941.0> call MyXQL.Connection.handle_commit([log: #Function<14.24086428/1 in Ecto.Adapters.SQL.with_log/3>, queue_target: 200, timeout: 30000, pool_size: 50, pool: DBConnection.ConnectionPool], %{__struct__: MyXQL.Connection, client: %{__struct__: MyXQL.Client, connection_id: 1642, sock: {:gen_tcp, #Port<0.165>}}, cursors: %{}, disconnect_on_error_codes: [1461], last_ref: #Reference<0.2627703310.817364993.43381>, ping_timeout: 15000, prepare: :named, queries: #Reference<0.2627703310.817496065.36463>, transaction_status: :transaction})

#PID<0.14941.0> [572] returned MyXQL.Connection.handle_commit/2 -> {:error, %{__exception__: true, __struct__: MyXQL.Error, connection_id: 1642, message: "Deadlock found when trying to get lock; try restarting transaction", mysql: %{code: 1213, name: nil}, statement: "COMMIT"}, %{__struct__: MyXQL.Connection, client: %{__struct__: MyXQL.Client, connection_id: 1642, sock: {:gen_tcp, #Port<0.165>}}, cursors: %{}, disconnect_on_error_codes: [1461], last_ref: #Reference<0.2627703310.817364993.43381>, ping_timeout: 15000, prepare: :named, queries: #Reference<0.2627703310.817496065.36463>, transaction_status: :transaction}}

the transaction_status is still transaction. We end up in the https://github.com/elixir-ecto/db_connection/blob/master/lib/db_connection.ex#L1645 with:

** (MyXQL.Error) (1213) Deadlock found when trying to get lock; try restarting transaction
    lib/db_connection.ex:1547: DBConnection.run_transaction/4
    (stdlib 3.17.2) erl_eval.erl:685: :erl_eval.do_apply/6
    (stdlib 3.17.2) erl_eval.erl:446: :erl_eval.expr/5

I am not sure about the behavior in this case.... shouldn't we trigger rollback if commit is failed because of error returned by underlying database driver? Or it should be done manually something like:

import Ecto.Query
for i <- 1..3 do
  spawn(fn -> 
    r = MyApp.Repo.transaction(fn ->
      MyApp.Repo.delete_all(from m in MyApp.Model)                              
      MyApp.Repo.insert(%MyApp.Model{id: "test_#{i}"})
    end)
    case r do
      {:error, error} ->
        Repo.rollback(r)
      _ ->
        :ok
    end
  end)
end

hmm but no ... that is MyXQL exception

josevalim commented 1 year ago

In this case I believe you should submit a PR to MyXQL. It should convert the error code 1213 on COMMIT into a DBConnection.TransactionError so DBConnection knows to roll it back. WDYT?

0xAX commented 1 year ago

yes, I also don't see how to handle it carefully without adjusting myxql, I'll try to prepare PR after adjustments and some testing

0xAX commented 1 year ago

@josevalim I've prepared small fix for this as you suggested in the https://github.com/elixir-ecto/myxql/compare/master...0xAX:myxql:return-transaction-error. Now I see in tracing that run_transaction returns {:error, :rollback} for the affected query. I'd expect to see rollback for this connection but instead I continue to see disconnect.

One more moment. For now handle_commit may return tuple with error, exception and connection state in the https://github.com/elixir-ecto/db_connection/blob/master/lib/db_connection.ex#L1734. Shouldn't it be handled there, something like:

case Holder.handle(pool_ref, :handle_commit, [], opts) do
  {:error, %DBConnection.TransactionError{}, _conn_state} ->
    {:rollback, run_rollback(conn, meter, opts)}
  ...
  ...
  ...

?

I see that handle_common_result may catch this clause {:error, exception, conn_state} but it will not lead to rollback