cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

teamcity: failed tests on master: Jepsen/Jepsen: JepsenSequential: JepsenSequential/start-stop-2, Jepsen/Jepsen: JepsenG2: JepsenG2/majority-ring+start-kill-2, Jepsen/Jepsen: JepsenG2: JepsenG2/parts+start-kill-2 #19930

Closed cockroach-teamcity closed 6 years ago

cockroach-teamcity commented 6 years ago

The following tests appear to have failed:

#402485:

--- FAIL: Jepsen/Jepsen: JepsenComments: JepsenComments/subcritical-skews (564.437s)
None
--- FAIL: Jepsen/Jepsen: JepsenComments: JepsenComments/majority-ring+subcritical-skews (576.436s)
None
--- FAIL: Jepsen/Jepsen: JepsenComments: JepsenComments/subcritical-skews+start-kill-2 (532.249s)
None
--- FAIL: Jepsen/Jepsen: JepsenSequential: JepsenSequential/start-stop-2 (62.768s)
None
--- FAIL: Jepsen/Jepsen: JepsenG2: JepsenG2/majority-ring+start-kill-2 (490.666s)
None
--- FAIL: Jepsen/Jepsen: JepsenG2: JepsenG2/parts+start-kill-2 (446.256s)
None

Please assign, take a look and update the issue accordingly.

bdarnell commented 6 years ago

Yay, the issue poster worked! (except that it needs an if branch == master check so it doesn't post issues when building a PR).

The comments failures are expected: This test was designed to expose the difference between serializability and linearizability, so it's not supposed to pass with our serializable transactions. It has been giving us false positives all this time because we weren't running the clock skew nemesis.

The sequential and g2 failures are new and will need to be investigated. The sequential test has been part of the nightly runs and has been passing for a long time; the g2 test was omitted from the configuration until recently so it could have been broken for a while.

bdarnell commented 6 years ago

The sequential failure looks like a mishandled "connection refused" error, not a correctness issue (and maybe related to the reported flakiness of the start-stop nemesis in #15736?)

The g2 failures were because the disks filled up on the worker machines. Since the failing configurations were the last two to be run, we probably just need to clean up logs between test runs.

bdarnell commented 6 years ago

With the reduced test suite, we're no longer (immediately) running into disk-full errors. I think everything else is covered by #19994.