cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

jepsen: "silent too long" #21048

Closed cockroach-teamcity closed 6 years ago

cockroach-teamcity commented 6 years ago

The following tests appear to have failed:

#457005:

--- FAIL: Jepsen/Jepsen: JepsenRegister: JepsenRegister/subcritical-skews+start-kill-2 (1211.152s)
None
--- FAIL: Jepsen/Jepsen: JepsenSets: JepsenSets/parts+start-kill-2 (1204.367s)
None

Please assign, take a look and update the issue accordingly.

bdarnell commented 6 years ago

These are both "silent too long" failures. In both cases, the jepsen process is waiting here in the setup phase.

Logs before this point:

WARN [2017-12-26 12:44:22,569] jepsen node 35.196.138.29 - jepsen.control.util DEPRECATED: jepsen.control.util/install-tarball! is now named jepsen.control.util/install-archive!, and the `node` argument is no longer required.
WARN [2017-12-26 12:44:22,569] jepsen node 35.185.76.85 - jepsen.control.util DEPRECATED: jepsen.control.util/install-tarball! is now named jepsen.control.util/install-archive!, and the `node` argument is no longer required.
WARN [2017-12-26 12:44:22,569] jepsen node 35.227.18.54 - jepsen.control.util DEPRECATED: jepsen.control.util/install-tarball! is now named jepsen.control.util/install-archive!, and the `node` argument is no longer required.
WARN [2017-12-26 12:44:22,569] jepsen node 35.190.129.71 - jepsen.control.util DEPRECATED: jepsen.control.util/install-tarball! is now named jepsen.control.util/install-archive!, and the `node` argument is no longer required.
WARN [2017-12-26 12:44:22,624] jepsen node 35.196.118.238 - jepsen.control.util DEPRECATED: jepsen.control.util/install-tarball! is now named jepsen.control.util/install-archive!, and the `node` argument is no longer required.
WARN [2017-12-26 12:44:24,407] jepsen node 35.227.18.54 - jepsen.control Encountered error with conn [:control "35.227.18.54"]; reopening
INFO [2017-12-26 12:44:25,637] jepsen node 35.196.138.29 - jepsen.cockroach.auto 35.196.138.29 Cockroach installed
INFO [2017-12-26 12:44:25,738] jepsen node 35.190.129.71 - jepsen.cockroach.auto 35.190.129.71 Cockroach installed
INFO [2017-12-26 12:44:25,739] jepsen node 35.185.76.85 - jepsen.cockroach.auto 35.185.76.85 Cockroach installed
INFO [2017-12-26 12:44:25,778] jepsen node 35.196.118.238 - jepsen.cockroach.auto 35.196.118.238 Cockroach installed
INFO [2017-12-26 12:44:32,449] jepsen node 35.196.138.29 - jepsen.cockroach.auto 35.196.138.29 clock reset: 26 Dec 12:44:32 ntpdate[5535]: step time server 91.189.91.157 offset 0.000099 sec
INFO [2017-12-26 12:44:32,549] jepsen node 35.185.76.85 - jepsen.cockroach.auto 35.185.76.85 clock reset: 26 Dec 12:44:32 ntpdate[4880]: step time server 91.189.91.157 offset 0.000003 sec
INFO [2017-12-26 12:44:32,549] jepsen node 35.190.129.71 - jepsen.cockroach.auto 35.190.129.71 clock reset: 26 Dec 12:44:32 ntpdate[6092]: step time server 91.189.91.157 offset 0.000004 sec
INFO [2017-12-26 12:44:32,589] jepsen node 35.196.118.238 - jepsen.cockroach.auto 35.196.118.238 clock reset: 26 Dec 12:44:32 ntpdate[6269]: step time server 91.189.91.157 offset -0.000012 sec

This appears to indicate some sort of failure while installing that was neither retried or logged as an exception. I suspect that something between the implicit parallelization over all nodes and the automatic retries is not layered properly, but it's hard to follow. auth.log on the failing node doesn't say anything interesting at this time.

bdarnell commented 6 years ago

The "silent too long" checks are no longer present in the roachtest jepsen runner, so the effective timeouts for test setup are now more generous. We'll see if we keep running into them, but it doesn't seem to have been an issue in the past week.