cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

tests: serializability violation running linearizable with process crashes #5653

Closed knz closed 8 years ago

knz commented 8 years ago

Detected in:

1) atomic compare-and-set

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160323T230041#20160323T230041 http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160324T004243#20160324T004243

2) monotonic-multitable

http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160324T000025#20160324T000025

These are two different tests, but sharing a common nemesis that forcefully kills processes with SIGKILL before restarting them. The first test finds a linearizability anomaly, the 2nd one a serializability anomaly.

tbg commented 8 years ago

I looked at the various logs for the second one (more interested in serializability failures at the moment) but there isn't much that I can really parse. Can you give me some idea of what's going on? http://jepsen.cockroachdb.com/cgi-bin/display.py?path=cockroachdb-atomic-lin%3Askews-startkill2%2F20160324T032842.000Z%2Fresults.edn isn't really descriptive (looks more like Jepsen crashed).

For the other one, the verifier output http://jepsen.cockroachdb.com/cgi-bin/display.py?path=cockroachdb-atomic-lin%3Astartkill2%2F20160323T230041.000Z%2Fresults.edn is more descriptive, but it's hard to map this failing history to operations which have happened (at least from the log http://jepsen.cockroachdb.com/cockroachdb-atomic-lin:startkill2/20160323T230041.000Z/history.txt) because there are no timestamps. How would you go about finding the parts that actually went wrong in the logs?

knz commented 8 years ago

Ok it looks like I screwed up with the links in the original description. I added the correct links now.

Regarding the first test, how to match the history to the result summary? In the result summary three details give you the reference:

The process number is the first column in the history, the sequence number is the first column in the argument to the operation. For example when it says in the map key 12 that process 22 could not read 4 from register 1, at index 36, that's because it should have read 1 (the expected value at that point). you find this by:

  1. restricting the history to all events that have the value 12 in their operand
  2. in that history, look at the 36th event, this should be done by process 22 (not 100% sure how to count yet, I have to check)

Analysis in follow-up comment

knz commented 8 years ago

For the first linked history what comes up here however is rather clear. First in the 12th history process 0 successfully writes 1. Then process 2 issues a write of 4 which eventually fails because the server dies. Despite the server dying, subsequent reads report 4 too although the write was not aknowledged.

Now I just realized that this is not as much a serializability error as a mistake in the test code. The server crashing should not be reported in the history as a failure to write, but merely that it is unknown whether the update succeeded or not. So I am enclined to ignore this negative test result for now (and I will fix the test too). Let's look at the other histories if they have a similar situation,

tbg commented 8 years ago

Yep, that sounds like a test bug. Thanks for the analysis hints.

On Thu, Mar 31, 2016 at 10:21 AM kena notifications@github.com wrote:

For the first linked history what comes up here however is rather clear. First in the 12th history process 0 successfully writes 1. Then process 2 issues a write of 4 which eventually fails because the server dies. Despite the server dying, subsequent reads report 4 too although the write was not aknowledged.

Now I just realized that this is not as much a serializability error as a mistake in the test code. The server crashing should not be reported in the history as a failure to write, but merely that it is unknown whether the update succeeded or not. So I am enclined to ignore this negative test result for now (and I will fix the test too). Let's look at the other histories if they have a similar situation,

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/cockroachdb/cockroach/issues/5653#issuecomment-203960282

knz commented 8 years ago

2nd test: 6th history. process 3 writes 3 successfully. Then process 0 issues a CAS from 3 to 0. The server dies before CAS is acknowledged, yet some readers see 0. Same issue as before, the test should report unknown instead of failure. We can ignore that too.

knz commented 8 years ago

The 3rd test is entirely different. There the results say that a linearizability error has occurred. What we see is that 2 transactions that should appear linearizable are not. We know both were enacted in the database, we know this for sure because the final read reports their inserted records. However the server is killed before either transaction is aknowledged to the client. The fact that the transactions where the linearizability error happens were at the boundary of a server kill supports suspicion that kills may not be handled properly.

Clock skews were involved but no server committed suicide as a result.

Can you see a potential scenario that could explain the observation?

bdarnell commented 8 years ago

How much time passes between the kill -9 and restart? The HLC state is not preserved across restarts so it's possible for time to move backwards if the restart occurs before the max offset has elapsed.

knz commented 8 years ago

This test used a 5 seconds delay between the kill and the restart.

aphyr commented 8 years ago

If I understand correctly, the first two results (for an atomic register) are spurious, but the third result (for monotonic-multitable) is real.

Extensive atomic register testing with 20160721 and 20160821 builds have not revealed any nonlinearizable results related to crashes with clock skew, though I suspect they should be there, because clock skew can cause a write to return before its 250-millisecond deadline.

The monotonic tests uncovered #9083, which allowed two transactions that should have conflicted to commit with the same timestamp and either identical or different table states; I suspect that could have allowed this issue as well. I'll re-test when we have a new build with #9083's fixes. :)

aphyr commented 8 years ago

On the latest builds, multiple days of register and monotonic tests with subcritical skews seems to check out okay. I'm calling this one good to close!