Closed knz closed 8 years ago
I looked at the various logs for the second one (more interested in serializability failures at the moment) but there isn't much that I can really parse. Can you give me some idea of what's going on? http://jepsen.cockroachdb.com/cgi-bin/display.py?path=cockroachdb-atomic-lin%3Askews-startkill2%2F20160324T032842.000Z%2Fresults.edn isn't really descriptive (looks more like Jepsen crashed).
For the other one, the verifier output http://jepsen.cockroachdb.com/cgi-bin/display.py?path=cockroachdb-atomic-lin%3Astartkill2%2F20160323T230041.000Z%2Fresults.edn is more descriptive, but it's hard to map this failing history to operations which have happened (at least from the log http://jepsen.cockroachdb.com/cockroachdb-atomic-lin:startkill2/20160323T230041.000Z/history.txt) because there are no timestamps. How would you go about finding the parts that actually went wrong in the logs?
Ok it looks like I screwed up with the links in the original description. I added the correct links now.
Regarding the first test, how to match the history to the result summary? In the result summary three details give you the reference:
:results
map (here you see keys listed in order 0, 7, 1, 4, ...) The process number is the first column in the history, the sequence number is the first column in the argument to the operation. For example when it says in the map key 12 that process 22 could not read 4 from register 1, at index 36, that's because it should have read 1 (the expected value at that point). you find this by:
Analysis in follow-up comment
For the first linked history what comes up here however is rather clear. First in the 12th history process 0 successfully writes 1. Then process 2 issues a write of 4 which eventually fails because the server dies. Despite the server dying, subsequent reads report 4 too although the write was not aknowledged.
Now I just realized that this is not as much a serializability error as a mistake in the test code. The server crashing should not be reported in the history as a failure to write, but merely that it is unknown whether the update succeeded or not. So I am enclined to ignore this negative test result for now (and I will fix the test too). Let's look at the other histories if they have a similar situation,
Yep, that sounds like a test bug. Thanks for the analysis hints.
On Thu, Mar 31, 2016 at 10:21 AM kena notifications@github.com wrote:
For the first linked history what comes up here however is rather clear. First in the 12th history process 0 successfully writes 1. Then process 2 issues a write of 4 which eventually fails because the server dies. Despite the server dying, subsequent reads report 4 too although the write was not aknowledged.
Now I just realized that this is not as much a serializability error as a mistake in the test code. The server crashing should not be reported in the history as a failure to write, but merely that it is unknown whether the update succeeded or not. So I am enclined to ignore this negative test result for now (and I will fix the test too). Let's look at the other histories if they have a similar situation,
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/cockroachdb/cockroach/issues/5653#issuecomment-203960282
2nd test: 6th history. process 3 writes 3 successfully. Then process 0 issues a CAS from 3 to 0. The server dies before CAS is acknowledged, yet some readers see 0. Same issue as before, the test should report unknown instead of failure. We can ignore that too.
The 3rd test is entirely different. There the results say that a linearizability error has occurred. What we see is that 2 transactions that should appear linearizable are not. We know both were enacted in the database, we know this for sure because the final read reports their inserted records. However the server is killed before either transaction is aknowledged to the client. The fact that the transactions where the linearizability error happens were at the boundary of a server kill supports suspicion that kills may not be handled properly.
Clock skews were involved but no server committed suicide as a result.
Can you see a potential scenario that could explain the observation?
How much time passes between the kill -9
and restart? The HLC state is not preserved across restarts so it's possible for time to move backwards if the restart occurs before the max offset has elapsed.
This test used a 5 seconds delay between the kill and the restart.
If I understand correctly, the first two results (for an atomic register) are spurious, but the third result (for monotonic-multitable) is real.
Extensive atomic register testing with 20160721 and 20160821 builds have not revealed any nonlinearizable results related to crashes with clock skew, though I suspect they should be there, because clock skew can cause a write to return before its 250-millisecond deadline.
The monotonic tests uncovered #9083, which allowed two transactions that should have conflicted to commit with the same timestamp and either identical or different table states; I suspect that could have allowed this issue as well. I'll re-test when we have a new build with #9083's fixes. :)
On the latest builds, multiple days of register and monotonic tests with subcritical skews seems to check out okay. I'm calling this one good to close!
Detected in:
1) atomic compare-and-set
http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160323T230041#20160323T230041 http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160324T004243#20160324T004243
2) monotonic-multitable
http://jepsen.cockroachdb.com/cgi-bin/display.py?entry=20160324T000025#20160324T000025
These are two different tests, but sharing a common nemesis that forcefully kills processes with SIGKILL before restarting them. The first test finds a linearizability anomaly, the 2nd one a serializability anomaly.