bloomberg / comdb2

Bloomberg's distributed RDBMS
Other
1.36k stars 209 forks source link

Serializability issue when running Concurrent TPCC #559

Open saatviks opened 6 years ago

saatviks commented 6 years ago

I've got ComDB2 running the TPCC benchmark on OLTPBench with submitting a single transaction at a time. But when trying to run concurrently by submitting multiple transactions at a time, there are errors in the results returned. Transactions are running in READ COMMITTED isolation level. Currently 2(OrderStatus and StockLevel) of 5 procedures in the TPCC benchmark are running successfully when submitting transactions concurrently - The other 3 fail. Is there any setting I need to tune for Concurrency?

I tried using the SERIALIZABLE isolation level. Now depending on the transaction I get different types of errors. Here are some:-

  1. java.sql.SQLException: [RC = 2] Table 'new_order' unable to delete genid =5a23671f80890002 rc=4
  2. java.sql.SQLIntegrityConstraintViolationException: [RC = -103] selectv constraints
  3. java.sql.SQLException: [RC = 2] unable to update record rc = 4
  4. java.sql.SQLException: [RC = 230] transaction is not serializable
akshatsikarwar commented 6 years ago

I hope you reviewed the transaction docs I sent you earlier.

There really is only one way to address this: retry the transaction.

Depending on how you run, either the system can automatically retry for you, or your application will have to. Consider following:

create table t (i int)
insert into t values(1)

Consider also that the rowid for above is 0.

Now 2 transactions run concurrently: update t set i = i + 1. Both statements will identify rowid 0 and ask master to set the new value. First transaction to commit will update new value and bump rowid to 1. Second transaction will fail to find rowid 0 and fail with a verify error. In this case comdb2 will rerun your update stmt and find rowid 1 and then update it. This is called verifyretry.

Running under snapshot or serializable isolation disables verifyretry (rerun of queries will not see new rows -- thats the whole point of snapshot.)

Running selectv stmt will also disable verifyretry -- the whole point is for application to be notified of verify error.

It is also possible to disable verifyretry. Run stmt: set verifyretry off. If you do this, the second transaction from our example above will fail with verify error.

Verifyretry is also disabled if you want to see per stmt effects (as opposed to per transaction effects). That may be default for JDBC or maybe that is how you are running it.

For most usecases, it is sufficient to run just the default isolation level and without selectv (i.e. no need to selectv the same rows that update stmt will operate on.) In default isolation level, a transaction can't see its own side-effects.

In any case, when the system cannot do automatic retry, it is up to application to handle verify errors and rerun transaction or take other action.

riverszhang89 commented 6 years ago

The JDBC driver does not disable `verifyretry' by default.

saatviks commented 6 years ago

Since OLTPBench requires Statement level query effects for a lot of its assertions - the only solution is to have the application i.e. Oltpbench in this case do the retries. However setting this up might require a lot of changes in Oltpbench itself. The solution we have come to for now is to limit the concurrency to 15 worker threads submitting transactions instead of 64 in which case the errors are not seen.

saatviks commented 6 years ago

After tuning the database, and getting much faster performance - we start getting the following warning from OltpBench/Comdb2 JDBC driver from time to time(~5-10 times per TPCC run):

[java] 03:51:25,673 (Worker.java:484) WARN  - The DBMS rejected the transaction without an error code
     [java] java.sql.SQLIntegrityConstraintViolationException: [RC = -103] constraints error, no genid
     [java]     at com.bloomberg.comdb2.jdbc.Comdb2Connection.createSQLException(Comdb2Connection.java:736)
     [java]     at com.bloomberg.comdb2.jdbc.Comdb2Connection.commit(Comdb2Connection.java:357)
     [java]     at com.oltpbenchmark.benchmarks.tpcc.TPCCWorker.executeWork(TPCCWorker.java:81)
     [java]     at com.oltpbenchmark.api.Worker.doWork(Worker.java:380)
     [java]     at com.oltpbenchmark.api.Worker.run(Worker.java:290)
     [java]     at java.lang.Thread.run(Thread.java:748)

We had been getting similar warnings and even worse errors when we kept a higher concurrency of transactions(~64) - But reducing the concurrency prevents(highly reduces the probability of occurence) this error. This is why we reduced it to 15. It is not a severe error, because of which OltpBench only issues a warning - But we are not sure if obtaining benchmarks with such a warning is correct? The problem is that the only way to reduce this warning/error is to reduce the concurrency from the Oltpbench side(since retry logic is not setup in it), which in a way beats the purpose of tuning it(#573).

Any suggestions except setting up retries to prevent this from happening?

P.S. We plan on discussing this with our course instructor too.

riverszhang89 commented 6 years ago

Akshat was right and I was wrong: invoking executeUpdate() does imply verifyretry off, and it will give you those errors when there're conflicting transactions. What I can do is to make a JDBC URL parameter for verifyretry, just like what we did for statement_query_effects. And you'd be able to toggle that easily.

saatviks commented 6 years ago

Oh thanks - that would be really good - Then we can simply enable retrying on the ComDB2 end? Would we be able to use this with both SelectV and statement_query_effects? And since our assignment deadline is this Thursday, would it be possible to add this feature by then?

riverszhang89 commented 6 years ago

No problem - I will get you a patch today.

riverszhang89 commented 6 years ago

@saatviks I've checked in my patch. To enable it, you'd use,

jdbc:comdb2://<host>/db?verify_retry=1

You can use it with statement_query_effects and any other URL parameters. It will work with SELECTV as long as the records selected by SELECTV are not modified by other transactions.

saatviks commented 6 years ago

Thanks @riverszhang89, I pulled in the latest commit and reinstalled ComDB2. I changed the JDBC URL as follows: jdbc:comdb2://localhost/tpcc?statement_query_effects=1;verify_retry=1 However now I'm immediately getting the error below when using just 15 worker threads - Its not in the procedure using SELECTV:

[java] java.lang.RuntimeException: java.lang.RuntimeException: Failed to update ORDER record [W_ID=5, D_ID=1, O_ID=2413]
     [java]     at com.oltpbenchmark.api.Worker.doWork(Worker.java:496)
     [java]     at com.oltpbenchmark.api.Worker.run(Worker.java:290)
     [java]     at java.lang.Thread.run(Thread.java:748)
     [java] Caused by: java.lang.RuntimeException: Failed to update ORDER record [W_ID=5, D_ID=1, O_ID=2413]
     [java]     at com.oltpbenchmark.benchmarks.tpcc.procedures.Delivery.run(Delivery.java:183)
     [java]     at com.oltpbenchmark.benchmarks.tpcc.TPCCWorker.executeWork(TPCCWorker.java:74)
     [java]     at com.oltpbenchmark.api.Worker.doWork(Worker.java:380)
     [java]     ... 2 more
     [java] 21:07:58,266 (Delivery.java:182) WARN  - Failed to update ORDER record [W_ID=5, D_ID=1, O_ID=2413]
     [java] 21:07:58,266 (Worker.java:495) ERROR - Fatal error when invoking Delivery/04
     [java] java.lang.RuntimeException: Failed to update ORDER record [W_ID=5, D_ID=1, O_ID=2413]
     [java]     at com.oltpbenchmark.benchmarks.tpcc.procedures.Delivery.run(Delivery.java:183)
     [java]     at com.oltpbenchmark.benchmarks.tpcc.TPCCWorker.executeWork(TPCCWorker.java:74)
     [java]     at com.oltpbenchmark.api.Worker.doWork(Worker.java:380)
     [java]     at com.oltpbenchmark.api.Worker.run(Worker.java:290)
     [java]     at java.lang.Thread.run(Thread.java:748)

Edit: I think I was getting a similar error when statement query effects was disabled. Additionally, if I remove the verify_retry=1 from the URL, things still work like they were earlier(i.e. with the genid warnings)

saatviks commented 6 years ago

Also,this line from the commit fprintf(stderr, "clnt is %p, sql is %s\n", clnt, clnt->sql); is flooding the server process.

riverszhang89 commented 6 years ago
jdbc:comdb2://localhost/tpcc?statement_query_effects=1&verify_retry=1

You'd change the semicolon to an ampersand.

Yes - that was a dumb mistake. I already removed the debug code. Sorry about that.

saatviks commented 6 years ago

I did try that first but then got the following error: Caused by: org.xml.sax.SAXParseException; systemId: file:/home/ubuntu/extracredit/oltpbench/config/tpcc_config_comdb2.xml; lineNumber: 7; columnNumber: 79; The reference to entity "verify_retry" must end with the ';' delimiter.

saatviks commented 6 years ago

I think I've got it to work - We have to modify the URL to allow XML to parse it: <DBUrl><![CDATA[jdbc:comdb2://localhost/tpcc?statement_query_effects=1&verify_retry=1]]></DBUrl>

saatviks commented 6 years ago

So with a combination of keeping verify_retry=1 and converting SELECTV to SELECT statements the warnings seem to be rarely happening for 15 worker threads(they still happen from time to time). I also tried increasing the concurrency from Oltpbench side to 64(the ideal value used for other DBs). However this almost immediately throws a number of warnings and exceptions. Here are some of the warnings/exceptions:

[java] java.lang.RuntimeException: java.lang.RuntimeException: W_ID=3 not found!
     [java]     at com.oltpbenchmark.api.Worker.doWork(Worker.java:496)
     [java]     at com.oltpbenchmark.api.Worker.run(Worker.java:290)
     [java]     at java.lang.Thread.run(Thread.java:748)
     [java] Caused by: java.lang.RuntimeException: W_ID=3 not found!
     [java]     at com.oltpbenchmark.benchmarks.tpcc.procedures.Payment.run(Payment.java:182)
     [java]     at com.oltpbenchmark.benchmarks.tpcc.TPCCWorker.executeWork(TPCCWorker.java:74)
     [java]     at com.oltpbenchmark.api.Worker.doWork(Worker.java:380)
     [java]     ... 2 more
     [java] 00:06:00,570 (Worker.java:495) ERROR - Fatal error when invoking Payment/02
     [java] java.lang.RuntimeException: W_ID=3 not found!
     [java]     at com.oltpbenchmark.benchmarks.tpcc.procedures.Payment.run(Payment.java:182)
     [java]     at com.oltpbenchmark.benchmarks.tpcc.TPCCWorker.executeWork(TPCCWorker.java:74)
     [java]     at com.oltpbenchmark.api.Worker.doWork(Worker.java:380)
     [java]     at com.oltpbenchmark.api.Worker.run(Worker.java:290)
     [java]     at java.lang.Thread.run(Thread.java:748)

Since the retrying should be happening - what might be the issue?

riverszhang89 commented 6 years ago

It seems that OLTPBench creates 64 loader threads, and then starts 64 worker threads without waiting for all of the loader threads to finish. Since you mentioned that your testbed is a c4.4xlarge EC2 instance which has 16 cores, it is possible that a loader thread gets preempted and scheduled after its corresponding worker thread, which then results in the W_ID not found error in the worker thread.

Can you apply the patch below to OLTPBench, rebuild it and run the TPCC benchmark again?

diff --git a/src/com/oltpbenchmark/api/Loader.java b/src/com/oltpbenchmark/api/Loader.java
index ff61bed..2f82bfd 100644
--- a/src/com/oltpbenchmark/api/Loader.java
+++ b/src/com/oltpbenchmark/api/Loader.java
@@ -108,6 +108,10 @@ public abstract class Loader<T extends BenchmarkModule> {
         for (LoaderThread t : threads) {
             t.run();
         }
+
+        for (LoaderThread t : threads) {
+            t.join();
+        }
     }
saatviks commented 6 years ago

Yup, I tried that but then Oltpbench fails to build with the following error:

  [javac] /home/ubuntu/extracredit/oltpbench/src/com/oltpbenchmark/api/Loader.java:112: error: cannot find symbol
    [javac]         t.join();
    [javac]          ^
    [javac]   symbol:   method join()
    [javac]   location: variable t of type Loader<T>.LoaderThread
    [javac]   where T is a type-variable:
    [javac]     T extends BenchmarkModule declared in class Loader
    [javac] 1 error

One thing though - I am myself running oltpbench in the following way(2 separate steps):

  1. Create tables and load data into the tables(--create=true --load=true --execute=false)
  2. Execute the benchmark.(--create=false --load=false --execute=true)
riverszhang89 commented 6 years ago

My mistake - LocalThread isn't a Thread instance but a Runnable instance.

There's a quick way to do this: Can you please add --clear=false to your 1st command? Looks like --create defaults to true which means all data will be cleared after the 1st command.

saatviks commented 6 years ago

That 'true' seems to be referring to the hasArg option - from here.

Edit: I dont think the data is getting cleared before execution - otherwise many assertions should have failed i think.

riverszhang89 commented 6 years ago

Okay. I don't think I understand this then. The W_ID=3 not found! was thrown by the update below:

UPDATE WAREHOUSE SET W_YTD = W_YTD + ? WHERE W_ID = 3

which can only mean that there was no such record whose W_ID is 3. However shouldn't the record exist though? The loader should already insert it into the table in your 1st command.

saatviks commented 6 years ago

Exactly, I was confused about the same. But the thing is that when I reduced the concurrency of Worker/Executor threads from 64 to <=15, this error does not occur any more - This is why I thought the error might be happening due to the READ COMMITTED not being strictly followed in the OCC somewhere or the retry issue explained in the start of this post?

saatviks commented 6 years ago

Also when I run this command manually using cdb2sql tpcc "UPDATE WAREHOUSE SET W_YTD = W_YTD + 5 WHERE W_ID = 3" it runs successfully giving me (rows updated=1)

riverszhang89 commented 6 years ago

But the thing is that when I reduced the concurrency of Worker/Executor threads from 64 to <=15, this error does not occur any more

Looks like the verifyretry=1 thing did not help you. We're back to square one now 😟 . I will read up on OLTPBench in order to better understand it. I don't have an answer for you right now, but I'd be shocked if we somehow did not honor read committed.

saatviks commented 6 years ago

For now, I'll submit the TPCC results for the assignment with 15 worker threads since tomorrow night is the deadline. Will be happy to coordinate later on doing it correctly with the 64 worker threads needed for TPCC. Thanks for helping out so much on this.