codership / galera

Synchronous multi-master replication library
GNU General Public License v2.0
447 stars 176 forks source link

Snapshot Isolation may not be as isolated as one would like #336

Open aphyr opened 9 years ago

aphyr commented 9 years ago

I might be wrong about this, and I'd like some help checking my work--but it looks like MariaDB + Galera Cluster, version 7.4.7 from the MariaDB Debian Jessie repos, may not correctly enforce snapshot isolation between transactions.

This Jepsen test sets up a five node Galera cluster and runs a simple simulation of transfers between two bank accounts, both starting with balance = 10.

create table if not exists accounts
  (id      int not null primary key,
   balance bigint not null)
INSERT INTO accounts ( id, balance ) VALUES ( 0, 10 )
INSERT INTO accounts ( id, balance ) VALUES ( 1, 10 )

We then have several clients, connected to all 5 nodes, execute a mix of read transactions and balance-transfer transactions at SERIALIZABLE. Read transactions simply read all balances:

           67 Connect   jepsen@192.168.122.1 as anonymous on jepsen
           67 Query show variables like 'max_allowed_packet'
           67 Query SELECT @@tx_isolation
           67 Query SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
           67 Query set autocommit=0
           67 Query select * from accounts
           67 Query COMMIT
           67 Query set autocommit=1
           67 Query SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ
           67 Quit

Transfer transactions read the balance of both accounts, move some from one to the other, and, if both remain positive, write back the new balances. This one transfers 0, and writes 10 and 10 respectively. Note that the write set always covers the read set, and there are only two rows in the table to begin with--I don't think we should be subject to write skew anomalies here, and phantoms shouldn't be an issue.

           62 Connect   jepsen@192.168.122.1 as anonymous on jepsen
           62 Query show variables like 'max_allowed_packet'
           62 Query SELECT @@tx_isolation
           62 Query SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
           62 Query set autocommit=0
           62 Query select * from accounts where id = 0
           62 Query select * from accounts where id = 1
           62 Query UPDATE accounts SET balance = 10 WHERE id = 0
           62 Query UPDATE accounts SET balance = 10 WHERE id = 1
           62 Query COMMIT
           62 Query ROLLBACK
           62 Query set autocommit=1
           62 Query SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ
           62 Quit  

As I understand the isolation level documentation, Galera Cluster should provide Serializable isolation locally and Snapshot Isolation between nodes. Since SI is higher than RR, I assume you have to set ISOLATION LEVEL SERIALIZABLE to actually obtain SI.

As I understand SI, given fixed working sets and with write sets that cover reads, all balances should sum to the same value. As a test invariant, we measure that every read transaction sees a consistent snapshot--in this case, that the balances always sum to 20. This works for low levels of concurrency, but with 50 concurrent clients, we start seeing anomalous reads. For instance:

4   :ok :read   [3 17]
0   :invoke :read   nil
5   :invoke :transfer   {:from 0, :to 1, :amount 3}
23  :invoke :read   nil
9   :ok :transfer   {:from 1, :to 0, :amount 1}
6   :invoke :transfer   {:from 1, :to 0, :amount 0}
34  :ok :transfer   {:from 1, :to 0, :amount 3}
23  :ok :read   [4 16]
5   :fail   :transfer   [:negative 0 -1]
0   :ok :read   [2 17]
33  :invoke :read   nil
18  :invoke :transfer   {:from 1, :to 0, :amount 4}
29  :invoke :transfer   {:from 0, :to 1, :amount 2}
33  :ok :read   [7 13]

See it? Process 0 reads the balances 2 and 17, which only sum to 19. A correct value, read by process 4, might have been [3 17], or the subsequent [7 13]. There are a lot of transactions in flight at any given point in this log, so it's a little hard to reconstruct the order of events. :(

Transfer transactions see inconsistent reads as well. For instance, this transaction only sees a total balance of 17, not 20, and attempts to write [9 8].

           66 Connect   jepsen@192.168.122.1 as anonymous on jepsen
           66 Query show variables like 'max_allowed_packet'
           66 Query SELECT @@tx_isolation
           66 Query SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
           66 Query set autocommit=0
           66 Query select * from accounts where id = 1
           66 Query select * from accounts where id = 0
           66 Query UPDATE accounts SET balance = 8 WHERE id = 1
           66 Query UPDATE accounts SET balance = 9 WHERE id = 0
           66 Query COMMIT
           66 Query ROLLBACK
           66 Query set autocommit=1
           66 Query SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ
           66 Quit

This particular commit conflicts and hits a rollback--and in fact, every transfer transaction I've analyzed so far appears to roll back, which suggests write promotion is doing the right thing here, and only read transactions fail to serialize appropriately. I'm struggling to understand whether this qualifies as an instance of http://www.cs.umb.edu/~poneil/ROAnom.pdf, or if these kinds of reads should be prevented in this case.

The complete logfiles from each host, including query logs, and the history of operations from Jepsen, plus analysis showing all inconsistent reads, are available here.

   {:type :wrong-total,
    :expected 20,
    :found 16,
    :op
    {:value [2 14],
     :time 15245300529,
     :process 1,
     :type :ok,
     :f :read}}
   {:type :wrong-total,
    :expected 20,
    :found 16,
    :op
    {:value [2 14],
     :time 15245328813,
     :process 21,
     :type :ok,
     :f :read}}
   {:type :wrong-total,
    :expected 20,
    :found 16,
    :op
    {:value [2 14],
     :time 15245350309,
     :process 16,
     :type :ok,
     :f :read}}
   {:type :wrong-total,
    :expected 20,
    :found 19,
    :op
    {:value [5 14],
     :time 97109842242,
     :process 86,
     :type :ok,
     :f :read}}]}}

Do you have any suggestions here? Have I misunderstood Galera's guarantees? Or is there additional debugging information I can provide?

sjaakola commented 9 years ago

The first committer wins rule is not honored by galera certification, this has been optimized for performance reasons, at some point. However, it is possible to support SI and make it configurable per session, so we should leave this as a potential task for some future release.

Serializable transactions are supported only if all writes are targeted to one node. Any replication coming from other nodes will not honor read locks held by serializable transaction. Achieving true serializability would require populating also read sets in Galera replication. This would cause major impact for performance, but again it would be configurable per session, and a potential new task for future, as well.

Thanks for the test case, and detailed bug report!

aphyr commented 9 years ago

The first committer wins rule is not honored by galera certification

That's... interesting. So when the docs say

Transaction isolation also occurs at the cluster level. Between transactions processing on separate nodes, Galera Cluster implements a transaction level called SNAPSHOT-ISOLATION. The SNAPSHOT-ISOLATION level occurs between REPEATABLE-READ and SERIALIZABLE.

Does that mean Galera should provide snapshot isolation? Without first-committer-wins, you can't really have snapshot isolation, right? Does Galera actually provide some weaker isolation level--e.g. snapshot isolation plus read skew, and, if so, should the docs be changed?

aphyr commented 9 years ago

Adjusting the concurrency and duration of the test, I can now confirm that transfer transactions are also affected: not only can read-only transactions see intermediate results from transfers, but transfers can write back intermediate results successfully, causing the permanent creation or destruction of money in the system.

Here are two test cases; in the first, the total rises from $20 to $22, and in the second, it falls to $15.

I believe this kind of data corruption is a more serious issue than inconsistent pure-read transactions--it may be a good idea to revise your documentation to reflect the possibility of data corruption for users relying on Snapshot Isolation's invariants.

sjaakola commented 9 years ago

By trading the "first committer wins rule" for performance, Galera does not any more provide SI. We need to change the documentation accordingly. As said, it is possible to bring back SI, as session property, we will consider it for future road map.

temeo commented 9 years ago

@aphyr I did some experimenting around Jepsen Galera bank test (with latest Galera Cluster release binaries from http://releases.galeracluster.com/) and found out the reason the test fails might be in SERIALIZABLE transaction isolation mode, or rather in LOCK IN SHARE MODE. MySQL documentation https://dev.mysql.com/doc/refman/5.6/en/set-transaction.html states about SERIALIZABLE:

This level is like REPEATABLE READ, but InnoDB implicitly converts all plain SELECT statements to SELECT ... LOCK IN SHARE MODE if autocommitis disabled.

so running test in SERIALIZABLE mode is essentially the same as running the test in REPEATABLE READ mode with SELECT statements modified to LOCK IN SHARE MODE.

After some experimenting I found out that setting transaction isolation mode to REPEATABLE READ made bank test pass, but making any of the SELECT statements to LOCK IN SHARE MODE explicitly made the test fail.

Also, setting transaction isolation mode to SERIALIZABLE and doing SELECTs in FOR UPDATE mode (see https://dev.mysql.com/doc/refman/5.6/en/innodb-locking-reads.html for explanation) made the test pass.

So it seems that something in galera/mysql-wsrep implementation (probably in prioritized transaction processing) does not honor shared locks properly or the effect of brute force abort caused by prioritized transaction is not propagated properly for locally executing (so far read only) transaction.

Concurrency level in tests was 50 concurrent clients, so I'm quite confident that the success result is valid.

Edit: My changes to jepsen can be found from: https://github.com/codership/jepsen/tree/galeracluster

temeo commented 9 years ago

The following simple MTR test case demonstrates that read-only transaction with SERIALIZABLE isolation level will normally abort if conflicting high priority (certified) transaction commits before it.

--source include/galera_cluster.inc
--source include/have_innodb.inc

#
# Test a local read-only transaction with SERIALIZABLE transaction isolation
# mode in precense of BF abort
#

CREATE TABLE t1 (f1 INTEGER PRIMARY KEY, f2 INTEGER) ENGINE=InnoDB;

--connection node_2
--let $wsrep_local_bf_aborts_before = `SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_local_bf_aborts'`
INSERT INTO t1 VALUES (1, 1);
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SET AUTOCOMMIT=OFF;
START TRANSACTION;
SELECT * FROM t1 WHERE f1 = 1;

--connection node_1
UPDATE t1 SET f2 = 2 WHERE f1 = 1;

--connection node_2
--error ER_LOCK_DEADLOCK
SELECT * FROM t1 where f1 = 1;

--let $wsrep_local_bf_aborts_after = `SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_local_bf_aborts'`

--disable_query_log
--eval SELECT $wsrep_local_bf_aborts_after - $wsrep_local_bf_aborts_before = 1 AS wsrep_local_aborts_increment;
--enable_query_log

DROP TABLE t1;
Marqin commented 8 years ago

How's the progress on this issue?

HendricksonJS commented 5 years ago

Any further updates? I am evaluating MariaDB TX and would love to know if this situation has changed.

LifeIsStrange commented 2 years ago

@sjaakola friendly ping as this seems like an important matter

ayurchen commented 2 years ago

@LifeIsStrange We have experimental SI support, but don't have enough demand to justify testing and maintaining it. It is not even clear if it will be of any practical use at all as it will drastically increase the certification interval and conflict probability. Essentially what you're asking for is keeping a distributed system in a permanent snapshot state. It COSTS. And mostly it costs concurrency.