Transactions on MariaDB-Galera violated Causal Consistency

20211202na commented 2 years ago

My transaction data generated on MariaDB-Galera-Cluster violated Causal Consistency.

I established the Galera Cluster on Emulab with 3 server nodes and 3 client nodes. Here is the configuration information:

Galera Version == 26.4.9 MariaDB Version == 10.4.22

server_node = 3 client_node = 3 session_per_client_node = 2 txn_per_session = 20 operation_per_txn = 25 key_number = 20 write_only_txn_rate = 0.2 read_only_txn_rate = 0.2 read_write_rate_per txn = 0.5

The violation appear on three transactions from two sessions, where r/w(A,B) denotes read/write value B on key A: T1 (session 2 txn 16): w(1,2272) T2 (session 2 txn 17): r(1,2272) r(0,6211) T3 (session 4 txn 18): r(1,2272) w(0,6211) w(1,6212)

T1 and T3 have wr order on key 1, so T1 -> T3; And T3 and T1 have commit order (co) T3 -> T1, because: T1 and T2 have wr order on key 1, so T1 -> T2; T3 and T2 have wr order on key 0, and T3 writes on key 1, so T3 -> T2; Thus, there is a cycle between T1 and T3, which violates the Causal Consistency.

ayurchen commented 2 years ago

Hi, could you please provide exact database schema and transactions code. Also, could you please explain what is "wr order" and what does "->" mean?

What I can understand from your report is: T1 writes (1,2272) T3 reads (1,2272) and writes (0.6211) T2 reads (1,2272) and then reads (0.6211) which does not look like any violation...

20211202na commented 2 years ago

Hi, we use a simple table schema, just containing key-value pairs. Here is the initialized table:

KEY	VALUE
0	0
1	0
2	0
...	0
19	0

For each write on each key, the value will be unique. I have uploaded the workload generation code in this repo: https://github.com/20211202na/galera_data_generation

The causal consistency we checked is from the paper published on SOSP '11 (https://dl.acm.org/doi/10.1145/2043556.2043593). In this paper, they define a consistency model---causal+. Causal+ is stronger than causal consistency because it adds convergent conflict handling. Another paper from POPL'17 (https://dl.acm.org/doi/10.1145/3093333.3009888) also has the definition of this property as below: “There is a total order between non-causally dependent operations and each site can execute operations only in that order (when it sees them). Therefore, a site is not allowed to revise its ordering of non-causally dependent operations, and all sites execute in the same order the operations that are visible to them.” Also, causal+ consistency is strictly weaker than snapshot isolation level.

The "wr" order means two transactions contain write/read operations on the same key with the same value respectively, thus the transaction with write operation should happen before the transaction contains the read operation. In my previous example, T1 contains operation w(1,2272) and T2 has operation r(1,2272), so T1 must happen earlier than T2, denoted as T1 -> T2. According to the established read-write relationship, w(1,2272) of T1 should be ordered after w(1,6212) of T3 in "arbitration order" (see Section 4.5 of the POPL'17 paper), but T3 also reads w(1,2272) of T1, forming the bad pattern, CyclicCF, defined in the POPL'17.

sciascid commented 2 years ago

@20211202na Are you using the READ COMMITTED isolation level in your test? With READ COMMITTED, the following execution would explain your findings:

// Initially CREATE TABLE t (k INTEGER PRIMARY KEY, v INTEGER); INSERT INTO t VALUES (0, 0); INSERT INTO t VALUES (1, 0);

// T1 START TRANSACTION; UPDATE t SET v = 2272 WHERE k =1; COMMIT;

// T2 START TRANSACTION; SELECT * FROM t WHERE k =1; // Returns 2272

// T3 START TRANSACTION; SELECT * FROM t WHERE k = 1; // Returns 2272 UPDATE t SET v = 6211 WHERE k = 0; UPDATE t SET v = 6212 WHERE k = 1; COMMIT;

// T2 SELECT * FROM t WHERE k =1; // Returns 6211 With READ COMMITTED, otherwise it returns 0 (the initial value) with REPEATABLE READ COMMIT;

20211202na commented 2 years ago

@sciascid Thanks for your reply. I used the default isolation level -- REPEATABLE READ for MariaDB and SNAPSHOT ISOLATION for Galera Cluster. The above transactions violate Causal Consistency. Since Causal Consistency is weaker than Snapshot Isolation, these transactions also violate Snapshot Isolation. Besides that, according to your explanation, did the above transactions also violate REPEATABLE READ?

20211202na commented 2 years ago

@ayurchen @sciascid In fact, after further investigation, this counterexample even violates "read atomicity" (RA), coined by Bailis et al. at SIGMOD'14 https://dl.acm.org/doi/10.1145/2588555.2588562. RA is even weaker than transactional causal consistency, which is in turn weaker than Snapshot Isolation.

Fig.2 (b) (also attached) in https://dl.acm.org/doi/10.1145/3360591 gives a nice visualization why our example violates RA, where t1, t3, and t3 in Fig.2 (b) are mapped to txn16, txn17, and txn18 in our case, respectively. 1643792533419

janlindstrom commented 2 years ago

We must remember here that InnoDB can offer REPEATABLE READ only by default. And as we know definition of this isolation level has issues as it is defined based on phenomena. See https://www.cs.umb.edu/~poneil/iso.pdf for A Critique of ANSI SQL Isolation Levels.

siliunobi commented 2 years ago

@janlindstrom Thanks for the pointer! As far as I understand, the original issue was reported on the anomalies found against the claimed Snapshot Isolation by Galera. According to the official document, with REPEATABLE READ by MariaDB and Snapshot isolation by Galera, causal consistency anomalies (as well as read atomicity anomalies in the last comment by the reporter) must not occur, right?

Plus, I myself am curious about which REPEATABLE READ MariaDB refers to. As far as I know, there are two different definitions in the literature. See the journal paper by Bailis et al. https://dl.acm.org/doi/10.1145/2909870 where one def is much stronger than the other.

janlindstrom commented 2 years ago

I think there is misunderstanding as https://galeracluster.com/library/training/tutorials/supporting-transaction-isolation-levels.html points out "if you have configured the default REPEATABLE READ isolation, transactions issued on the same node will behave under REPEATABLE READ semantics. However, for transactions issued on separate cluster nodes, the ‘first committer wins’ rule of SNAPSHOT ISOLATION is provided Therefore, it is not safe for the application to rely on SNAPSHOT ISOLATION semantics. ".

siliunobi commented 2 years ago

If I understand you correctly, Galera cannot provide FULL snapshot isolation as defined in the literature (for separate cluster nodes)? If only the ‘first committer wins’ rule guaranteed (mainly about write-write conflict), then Galera doesn't provide causality guarantee at all, which is an important building block of any mechanism that supports SI.
Even if Galera cannot support full SI, it seems it cannot even provide "atomic visibility" https://github.com/codership/galera/issues/609#issuecomment-1027720220 This isolation level is far weaker than SI. Essentially, with Galera, a transaction's effects (or writes) cannot be guaranteed to be visible as a whole to another transaction.

sciascid commented 2 years ago

@sciascid Thanks for your reply. I used the default isolation level -- REPEATABLE READ for MariaDB and SNAPSHOT ISOLATION for Galera Cluster. The above transactions violate Causal Consistency. Since Causal Consistency is weaker than Snapshot Isolation, these transactions also violate Snapshot Isolation. Besides that, according to your explanation, did the above transactions also violate REPEATABLE READ?

As explained in my previous comment, I asked whether you were using Read Committed, because that would have explained your execution posted in the ticket. Since you are not using Read Committed, we still have no explanation of how this happened. In order to better understand the issue we should probably try reproduce your execution.

Could you tell me if transactions T1, T2 and T3 were all submitted against the same node in the cluster, or did those run in different nodes? Also, notice that if these transactions were running in different nodes, one or more of them could have been aborted. Can you confirm there were no aborted transactions involved in the execution?

sciascid commented 2 years ago

@20211202na, after skimming through the client code you used for testing, I think that the error handling could potentially explain the results you are observing.

Basically, for each transaction you have a list of operations (either SELECT or UPDATE) which you execute in this loop: https://github.com/20211202na/galera_data_generation/blob/main/galera_data.py#L258 Should one of these operations fail due to an error, then you simply keep looping. Notice that in MariaDB/Galera a UPDATE could for example be aborted due to multi-primary conflicts, which would result in the whole transaction to be aborted. However, the loop keeps going, and the next operation will start a new transaction. Finally, at the end of the loop the transaction is committed, regardless of errors or not: https://github.com/20211202na/galera_data_generation/blob/main/galera_data.py#L284 So it appears that your test driver is not respecting the intended transaction boundaries. In other words, it appears that you are committing partial transactions to the database.

janlindstrom commented 2 years ago

Galera cannot provide FULL snapshot isolation as defined in the literature because InnoDB does not provide it.

siliunobi commented 2 years ago

Galera cannot provide FULL snapshot isolation as defined in the literature because InnoDB does not provide it.

Thanks for the clarification! Well, I think, ideally, Galera's consistency guarantee could strengthen, instead of being restricted to, that provided by the underlying DB.

20211202na commented 2 years ago

@20211202na, after skimming through the client code you used for testing, I think that the error handling could potentially explain the results you are observing.

Basically, for each transaction you have a list of operations (either SELECT or UPDATE) which you execute in this loop: https://github.com/20211202na/galera_data_generation/blob/main/galera_data.py#L258 Should one of these operations fail due to an error, then you simply keep looping. Notice that in MariaDB/Galera a UPDATE could for example be aborted due to multi-primary conflicts, which would result in the whole transaction to be aborted. However, the loop keeps going, and the next operation will start a new transaction. Finally, at the end of the loop the transaction is committed, regardless of errors or not: https://github.com/20211202na/galera_data_generation/blob/main/galera_data.py#L284 So it appears that your test driver is not respecting the intended transaction boundaries. In other words, it appears that you are committing partial transactions to the database.

@sciascid Hi, thanks for your response. If I understand correctly, I had turned off the auto commit in Line 248, then the whole transaction started from Line 255 will not commit until Line 284. Also, for each try UPDATE or SELECT operation, there is an exception followed up. If any of these operations are aborted, then the whole transaction will not be considered as a committed transaction, guaranteed by Boolean e_flag. So I believe all the three transactions involved in this issue are fully committed.

20211202na commented 2 years ago

@sciascid It would be great if you could try our tester yourself. Note that, since we are currently doing random testing (e.g., workloads are generated probabilistically), it’s hard (or even impossible) to reproduce a specific anomaly (e.g., with the same key-value pairs read/written). But, anomalies manifest with sufficiently large number of runs (e.g., in our case, several hundreds with the parameters in the original post). Plus, more concurrency is expected to give more anomalies, e.g., with more clients, less keys, more txns, more ops per txn, etc. Hope this helps!

Our tester considers committed txns only, though the actual execution includes aborted ones.

sciascid commented 2 years ago

@sciascid Hi, thanks for your response. If I understand correctly, I had turned off the auto commit in Line 248, then the whole transaction started from Line 255 will not commit until Line 284. Also, for each try UPDATE or SELECT operation, there is an exception followed up. If any of these operations are aborted, then the whole transaction will not be considered as a committed transaction, guaranteed by Boolean e_flag. So I believe all the three transactions involved in this issue are fully committed.

You are still making changes to the database, even with autocommit off. To give you an example, I'm pretty sure this can be happening with your tester:

START TRANSACTION; // some random operations here UPDATE row 1 // Suppose this UPDATE returns an error // loop continues UPDATE row 2 // since autocommit is off, this is not committed automatically, server starts a new transaction implicitly ... // end of loop COMMIT; // this commits row 2, half of the transaction you intended to have

As you say, because of e_flag, you are not including the transaction in the result... but nonetheless the database has changed. So a subsequent transaction may read this content from the database. This raises the following question: how will your verification code react if it find a transaction that reads row 2? To which transaction will the corresponding write be attributed to, when observed by another transaction?

@sciascid It would be great if you could try our tester yourself.

I could try, but the client alone is not very useful. This would give me a trace of the transactions, but how would I tell if the trace violates any consistency criteria?

Also, it would be good to know in which nodes did T1, T2 and T3 execute, did they execute on different nodes in the cluster?

20211202na commented 2 years ago

@sciascid Thanks! Answers inline:

@sciascid Hi, thanks for your response. If I understand correctly, I had turned off the auto commit in Line 248, then the whole transaction started from Line 255 will not commit until Line 284. Also, for each try UPDATE or SELECT operation, there is an exception followed up. If any of these operations are aborted, then the whole transaction will not be considered as a committed transaction, guaranteed by Boolean e_flag. So I believe all the three transactions involved in this issue are fully committed.

You are still making changes to the database, even with autocommit off. To give you an example, I'm pretty sure this can be happening with your tester:

START TRANSACTION; // some random operations here UPDATE row 1 // Suppose this UPDATE returns an error // loop continues UPDATE row 2 // since autocommit is off, this is not committed automatically, server starts a new transaction implicitly ... // end of loop COMMIT; // this commits row 2, half of the transaction you intended to have

As you say, because of e_flag, you are not including the transaction in the result... but nonetheless the database has changed. So a subsequent transaction may read this content from the database. This raises the following question: how will your verification code react if it find a transaction that reads row 2? To which transaction will the corresponding write be attributed to, when observed by another transaction?

First, the tester builds up a graph based on the committed txns only (each node in the graph is a committed txn). In this case, since we have already aborted the txn having the write (which will not appear in the graph), the read txn on row 2 will not have a write-read relation to the aborted txn.

Second, the tester detects any causal-consistency-specific cycle in the graph built up in the above way. In this case, the above read txn on row 2 will not contribute to the pattern as shown in https://github.com/codership/galera/issues/609#issuecomment-1027720220

@sciascid It would be great if you could try our tester yourself.

I could try, but the client alone is not very useful. This would give me a trace of the transactions, but how would I tell if the trace violates any consistency criteria?

See https://github.com/20211202na/galera_data_generation for the instruction. Note that our tester currently only tells if a trace violates transactional causal consistency (TCC) or not. For weaker consistency properties (e.g., RR) than TCC, one has to manually examine the counterexample.

Also, it would be good to know in which nodes did T1, T2 and T3 execute, did they execute on different nodes in the cluster?

They were executed on the same cluster node.

sciascid commented 2 years ago

@20211202na thanks for your clarifications, I will try to reproduce the issue.

20211202na commented 2 years ago

Great! Let me know if any question! Thanks!

sciascid commented 2 years ago

@20211202na I tried the test, and managed to reproduce something. The output of oopsla_txn_graph.py gives me:

reach key is: 0     
BP111111 found in: 0
reach key is: 0     
BP222222 found in: 0

Followed by lots and lots of numbers:

0               
0 1             
1               
0 1 2           
1 2             
2               
0 1 2 3         
1 2 3           
2 3             
3               
0 1 2 3 4       
1 2 3 4         
2 3 4           
3 4             
4               
0 1 2 3 4 5     
1 2 3 4 5       
2 3 4 5         
3 4 5           
4 5             
5               
6 0 1 2 3 4 5   
0 1 2 3 4 5 7   
1 2 3 4 5 7     
2 3 4 5 7       
3 4 5 7         
4 5 7           
5 7             
6 0 1 2 3 4 5 7 
...
# ~300K more lines here
...
None

Like that... for a total ~300K lines of output. Is that expected? It is not clear how to interpret the output. What is BP111111, BP222222, and reach key?

20211202na commented 2 years ago

@20211202na I tried the test, and managed to reproduce something. The output of oopsla_txn_graph.py gives me:
reach key is: 0     
BP111111 found in: 0
reach key is: 0     
BP222222 found in: 0
Like that... for a total ~300K lines of output. Is that expected? It is not clear how to interpret the output. What is BP111111, BP222222, and reach key?

@sciascid The output looks unusual. I think there might be a problem when normalizing the data. Before running group_data.py, have you modified the parameter ops_per_trans in https://github.com/20211202na/galera_data_generation/blob/main/group_data.py#L21 ? If not, it may cause errors and affect the algorithm results. Also, can you upload the generated results? I can make sure if the problem occurred during the process of normalization.

The BP11 and BP22 refer to two different bad patterns (or cycles) found in given transactions. Due to algorithm implementation, we cannot directly print out the detected cycles, the reach key refers to the last transaction detected in the cycle. So, when you find the reach key, you can change the last line in https://github.com/20211202na/galera_data_generation/blob/main/oopsla_txn_graph.py#L230 to print the related transactions in the cycle.

sciascid commented 2 years ago

@20211202na the problem was indeed due to ops_per_trans parameter. I since then made sure that the parameter matches in the two files.

I managed to reproduce some cases. For example this one:

reach key is: 295       
BP222222 found in: 0    
898                     
295 898                 
898 296                 
295 898 296             
296                     
898 296 297             
295 898 296 297         
296 297                 
297                     
295                     
None

But I haven't yet analyzed it in detail. I presume these are transaction ids, but I'm not sure how to interpret the output. Could you explain what this output means?

I managed to get this from a single MariaDB node, with Galera replication disabled.

sciascid commented 2 years ago

For reference, I'm pasting the involved transactions below.

295:

w(1,3276,0,295) 
w(17,3277,0,295)
w(7,3278,0,295) 
w(2,3279,0,295) 
w(1,3280,0,295) 
w(1,3281,0,295) 
w(1,3282,0,295) 
w(2,3283,0,295) 
w(1,3284,0,295) 
w(4,3285,0,295) 
w(8,3286,0,295) 
w(7,3287,0,295) 
w(0,3288,0,295) 
w(7,3289,0,295) 
w(2,3290,0,295) 
w(0,3291,0,295) 
w(1,3292,0,295) 
w(7,3293,0,295) 
w(0,3294,0,295) 
w(1,3295,0,295) 
w(0,3296,0,295) 
w(8,3297,0,295) 
w(1,3298,0,295) 
w(10,3299,0,295)
w(19,3300,0,295)

296:

r(13,3250,0,296)
r(8,3297,0,296)
r(2,3290,0,296)
r(1,3298,0,296)
r(0,33324,0,296)
r(0,33324,0,296)
r(19,33328,0,296)
r(3,3248,0,296)
r(0,33324,0,296)
r(2,3290,0,296)
r(4,3285,0,296)
r(3,3248,0,296)
r(1,3298,0,296)
r(12,33326,0,296)
r(0,33324,0,296)
r(6,93382,0,296)
r(0,33324,0,296)
r(3,3248,0,296)
r(13,3250,0,296)
r(5,3242,0,296)
r(0,33324,0,296)
r(2,3290,0,296)
r(7,3293,0,296)
r(0,33324,0,296)
r(11,33276,0,296)

297:

r(1,3298,0,297)
r(16,73188,0,297)
r(0,33324,0,297)
r(0,33324,0,297)
r(0,33324,0,297)
r(19,33328,0,297)
r(5,3242,0,297)
r(0,33324,0,297)
r(16,73188,0,297)
r(0,33324,0,297)
r(5,3242,0,297)
r(6,93382,0,297)
r(16,73188,0,297)
w(9,3326,0,297)
w(7,3327,0,297)
w(0,3328,0,297)
w(10,3329,0,297)
w(1,3330,0,297)
w(19,3331,0,297)
w(0,3332,0,297)
w(0,3333,0,297)
w(0,3334,0,297)
w(4,3335,0,297)
w(0,3336,0,297)
w(5,3337,0,297)

898:

r(3,3248,3,898)
r(5,3242,3,898)
r(3,3248,3,898)
r(1,3240,3,898)
r(1,3240,3,898)
r(0,3243,3,898)
r(2,3247,3,898)
r(1,3240,3,898)
r(10,3244,3,898)
r(15,3249,3,898)
r(0,3243,3,898)
r(1,3240,3,898)
r(0,3243,3,898)
r(2,3247,3,898)
w(0,33318,3,898)
w(0,33319,3,898)
w(0,33320,3,898)
w(0,33321,3,898)
w(15,33322,3,898)
w(0,33323,3,898)
w(0,33324,3,898)
w(14,33325,3,898)
w(12,33326,3,898)
w(9,33327,3,898)
w(19,33328,3,898)

The only thing I can conclude from this is that 295, 296, 297 come from the same client. Therefore 295 -> 296 -> 297. 297 reads the version of key 0, written by 898
296 read the version of key 12, written by 898 Therefore 898 -> 297, and 898 -> 296.

But that's not enough to get a cycle.

20211202na commented 2 years ago

The only thing I can conclude from this is that 295, 296, 297 come from the same client. Therefore 295 -> 296 -> 297. 297 reads the version of key 0, written by 898 296 read the version of key 12, written by 898 Therefore 898 -> 297, and 898 -> 296.

But that's not enough to get a cycle.

@sciascid These 4 txns are the vital clue to draw the cycle and they may not form the cycle themselves. Can you please share me with the whole result folder (including all result-node.txt and the final result.txt)? I can then find the cycle, as well as the associated txns/operations, for you (by debugging 'adj_map'). Meanwhile, let me automate this process.

sciascid commented 2 years ago

@20211202na You can find all the files here: https://gist.github.com/sciascid/aae12c130bdcfe1930601f19ce0f29d5/archive/6b214d707fdf0fb72b2505955c0e23d38717db6d.zip

20211202na commented 2 years ago

@sciascid Three findings from your data (@janlindstrom you may also find point 3 interesting):

the cycle between Txn297 and Txn295. Txn297 ->co Txn295 because (i) Txn295 ->wr_4 Txn393; (ii) Txn297 ->wr Txn393; and (iii) Txn297 also writes key 4

WechatIMG793

this case also violates "read atomicity" since Txn297 can reach to Txn393 by a single write-read relation.
by carefully examining the data, we even found a "repeatable read" anomaly, with a strange read by Txn393 fetching value 3328 of key 0. See below. Note that both values (3328 resp. 33324) were written by committed txns (Txn297 resp. Txn898).

sciascid commented 2 years ago

@20211202na Hi, thanks for the detailed analysis. Given that these findings were based on the results of running your tool on MariaDB with Galera disabled, I'm afraid we can't do much about it. If the underlying database / storage engine allows for a certain anomaly (such as violation of read atomicity or violation of repeatable read, as you suggested in your last comment) then we expect to observe the same anomalies when Galera replication is enabled. In other words, Galera does not offer isolation guarantees stronger than those offered by the underlying storage engine. And when Galera is enabled, we do not interfere with how the storage engine handles reads, and how read views or snapshots are assigned to a transaction. This is by design, as we want Galera to be as transparent as possible to applications,so that applications written for MariaDB/MySQL work out of the box with Galera. Having said that, I also repeated the same tests using MySQL instead of MariaDB, and your tool has not reported any failures so far. This might suggest that there is some difference between MySQL and MariaDB. Having said that, I think this issue should be reported to MariaDB.

siliunobi commented 2 years ago

@sciascid Thanks for your further investigation! Some comments inline:

In other words, Galera does not offer isolation guarantees stronger than those offered by the underlying storage engine.

This is confusing. Correct me if Im wrong. For example, Galera provides SI, a stronger consistency guarantee than RR, and therefore the extra strength beyond RR. Similarly, Galera is also expected to guarantee transactional causal consistency (weaker than SI but stronger than RR), and, again, Galera seems to be the one who implements the additional guarantee.

So, IF what you said is really what is happening in Galera, then claiming Snapshot Isolation makes less sense, despite the explanation https://github.com/codership/galera/issues/609#issuecomment-1028874477

But it makes sense that any anomaly allowed by the underlying DB engine would probably manifest at "the Galera level".

Having said that, I also repeated the same tests using MySQL instead of MariaDB, and your tool has not reported any failures so far. This might suggest that there is some difference between MySQL and MariaDB. Having said that, I think this issue should be reported to MariaDB.

Thanks again! We are planning tests on MariaDB as well, perhaps after the above confusions resolved.

sciascid commented 2 years ago

@siliunobi Hi, I'm sorry for the confusion. I'll try to clarify here. We do not claim that Galera provides Snapshot Isolation. Comment 609, and the linked page (https://galeracluster.com/library/training/tutorials/supporting-transaction-isolation-levels.html) to our documentation states: "So, if you have configured the default REPEATABLE READ isolation, transactions issued on the same node will behave under REPEATABLE READ semantics. However, for transactions issued on separate cluster nodes, the ‘first committer wins’ rule of SNAPSHOT ISOLATION is provided" and a bit later: "Therefore, it is not safe for the application to rely on SNAPSHOT ISOLATION semantics"

Snapshot isolation is only mentioned because we do use "first committer wins" rule when certifying concurrent transaction that executed in different nodes of the cluster. But that alone is not sufficient to provide SI. Hence, the documentation clearly states "it is not safe for the application to rely on SNAPSHOT ISOLATION semantics".

I think we do not claim Snapshot Isolation anywhere in the documentation, but I will double check now.

Thanks, and let me know if this clarifies

siliunobi commented 2 years ago

@sciascid Sth worth considering:

https://galeracluster.com/library/training/tutorials/supporting-transaction-isolation-levels.html

Certainly, there's a claim "Galera Cluster provides SNAPSHOT ISOLATION between transactions running on separate cluster nodes.", the first sentence in that paragraph, though further explanation (like what you pasted) is added later. This is confusing. If you google, you would find many people have got mislead :) Perhaps Galera could consider to make it precise.

we do use "first committer wins" rule when certifying concurrent transaction that executed in different nodes of the cluster.

Well, if Galera only wants to make the point of "first committer wins" to resolve conflict, I don't think it's necessary to bring SI in, though SI typically uses it. Correct me if Im wrong. "first committer wins" is consistency-model-agnostic; one could even implement the weakest consistency with it.

Thanks!

sciascid commented 2 years ago

@siliunobi I totally agree with you, it is confusing. I will propose to review and improve all explanations related to transaction isolation in the documentation. Thanks for the valuable feedback!

siliunobi commented 2 years ago

@sciascid Sure. Thank YOU, especially for the investigation on MariaDB! I'll have you posted on the testing progress for that.

Regarding SI, sth to add:

Essentially, SI has 3 main building blocks: write-write conflict, atomic visibility, and causality.

"write-write conflict" is relevant to "first committer wins" we have discussed. I found a previous issue report on this topic https://github.com/codership/galera/issues/336 The remaining two building blocks are exactly what our testing focuses on.

codership / galera

Transactions on MariaDB-Galera violated Causal Consistency #609