cmu-db / peloton

The Self-Driving Database Management System
http://pelotondb.io
Apache License 2.0
2.03k stars 624 forks source link

SerializableTransactionTests.ReadOnlyTransactionTest failing intermittently #1427

Open tli2 opened 6 years ago

tli2 commented 6 years ago

Several complaints about this failing from time to time on Travis. Might be a Mac issue, might be that the test is broken, might be that we broke something in CC.

Investigation would be appreciated.

mbutrovich commented 6 years ago

Here's an example of how it fails, and on rerun it typically passes. Maybe some nondeterminism in how stuff is being scheduled:

        Start  46: serializable_transaction_test

46: Test command: /Users/travis/build/cmu-db/peloton/build/test/serializable_transaction_test "--gtest_color=yes" "--gtest_output=xml:/Users/travis/build/cmu-db/peloton/build/test/serializable_transaction_test.xml"

46: Environment variables: 

46:  LSAN_OPTIONS=suppressions=/Users/travis/build/cmu-db/peloton/test/leak_suppr.txt

46: Test timeout computed to be: 9.99988e+06

46: Running main() from gmock_main.cc

46: [==========] Running 6 tests from 1 test case.

46: [----------] Global test environment set-up.

46: [----------] 6 tests from SerializableTransactionTests

46: [ RUN      ] SerializableTransactionTests.TransactionTest

46: [       OK ] SerializableTransactionTests.TransactionTest (36 ms)

46: [ RUN      ] SerializableTransactionTests.ReadOnlyTransactionTest

46: 2018-06-25 20:34:00 [test/concurrency/testing_transaction_util.cpp:192:CreateTable] INFO  - the database_id is 16777216

46: 2018-06-25 20:34:00 [test/concurrency/serializable_transaction_test.cpp:330:ExecuteNext] INFO  - Txn 0 commits: Success

46: /Users/travis/build/cmu-db/peloton/test/concurrency/serializable_transaction_test.cpp:90: Failure

46: Value of: scheduler.schedules[0].results.size()

46:   Actual: 0

46: Expected: 10

46: [  FAILED  ] SerializableTransactionTests.ReadOnlyTransactionTest (366 ms)
tli2 commented 6 years ago

Also worth noting, on both occasions observed, this happens on mac builds only

mbutrovich commented 6 years ago

I’ll spend some time on it today.

mbutrovich commented 6 years ago

Haven't been able to reproduce it locally all day :(

lmwnshn commented 6 years ago

I just tried running it 10-20 times and got it to fail:

[build] ASAN_OPTIONS=detect_container_overflow=0 ./test/serializable_transaction_test
Running main() from gmock_main.cc
[==========] Running 6 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 6 tests from SerializableTransactionTests
[ RUN      ] SerializableTransactionTests.TransactionTest
[       OK ] SerializableTransactionTests.TransactionTest (28 ms)
[ RUN      ] SerializableTransactionTests.ReadOnlyTransactionTest
2018-06-26 13:46:19 [test/concurrency/testing_transaction_util.cpp:192:CreateTable] INFO  - the database_id is 16777216
2018-06-26 13:46:19 [test/concurrency/serializable_transaction_test.cpp:314:ExecuteNext] INFO  - Txn 0 commits: Success
/Users/wanshenl/Documents/pavlo/peloton/test/concurrency/serializable_transaction_test.cpp:90: Failure
Value of: scheduler.schedules[0].results.size()
  Actual: 0
Expected: 10
[  FAILED  ] SerializableTransactionTests.ReadOnlyTransactionTest (217 ms)

Same failure. Didn't have to rebuild between runs.

mbutrovich commented 6 years ago

Can you get a TRACE level log of it to get a better idea what's happening? I still can't get it to reproduce.

mbutrovich commented 6 years ago

@lmwnshn helpfully ran it 1000 times on this commit just before I did all the TransactionContext refactoring in the last couple of weeks. It still reproduced. My suspicion right now is that in rare cases the bootstrapping process (which seems slower in the last month) takes longer than we wait in the test (1 epoch) and it triggers an abort on the scan. We have 2 experiments to try today:

  1. increase the wait time in the test on master
  2. go back to a commit a month or so ago before we had all the bootstrapping changes. I need to identify the exact commit.

I should add that we've also seen it on the Ubuntu boxes, so this is no longer a Mac-exclusive issue.

mbutrovich commented 6 years ago

The test we occasionally fail was modified in #1322. I'll have to spend more time looking at it tomorrow to understand why we seem to have a race.