cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

teamcity: Jepsen tests fail due to ssh connect timeout #19994

Closed cockroach-teamcity closed 6 years ago

cockroach-teamcity commented 6 years ago

The following tests appear to have failed:

#405731:

--- FAIL: Jepsen/Jepsen: JepsenG2: JepsenG2/subcritical-skews (86.329s)
None
--- FAIL: Jepsen/Jepsen: JepsenMonotonic: JepsenMonotonic/subcritical-skews+start-kill-2 (86.115s)
None
--- FAIL: Jepsen/Jepsen: JepsenSequential: JepsenSequential/subcritical-skews+start-kill-2 (609.970s)
None
--- FAIL: Jepsen/Jepsen: JepsenSequential: JepsenSequential/majority-ring+start-kill-2 (0.327s)
None
--- FAIL: Jepsen/Jepsen: JepsenSequential: JepsenSequential/parts+start-kill-2 (0.318s)
None
--- FAIL: Jepsen/Jepsen: JepsenSets: JepsenSets/majority-ring (0.315s)
None
--- FAIL: Jepsen/Jepsen: JepsenSets: JepsenSets/split (0.320s)
None
--- FAIL: Jepsen/Jepsen: JepsenSets: JepsenSets/start-kill-2 (0.324s)
None
--- FAIL: Jepsen/Jepsen: JepsenSets: JepsenSets/start-stop-2 (0.311s)
None
--- FAIL: Jepsen/Jepsen: JepsenSets: JepsenSets/strobe-skews (0.310s)
None
--- FAIL: Jepsen/Jepsen: JepsenSets: JepsenSets/subcritical-skews (0.321s)
None
--- FAIL: Jepsen/Jepsen: JepsenSets: JepsenSets/majority-ring+subcritical-skews (0.317s)
None
--- FAIL: Jepsen/Jepsen: JepsenSets: JepsenSets/subcritical-skews+start-kill-2 (0.317s)
None
--- FAIL: Jepsen/Jepsen: JepsenSets: JepsenSets/majority-ring+start-kill-2 (0.324s)
None
--- FAIL: Jepsen/Jepsen: JepsenSets: JepsenSets/parts+start-kill-2 (0.325s)
None

Please assign, take a look and update the issue accordingly.

bdarnell commented 6 years ago

These are all ssh timeouts (the first we've captured since adding -v to our final scp command in #19875)

[13:45:31][/subcritical-skews+start-kill-2] +(teamcity-jepsen-run-one.sh) scp -o ServerAliveInterval=60 -o 'StrictHostKeyChecking no' -i /home/agent/.ssh/google_compute_engine -C -r -v -v -v 'ubuntu@35.190.162.127:jepsen/cockroachdb/store/latest/{test.fressian,results.edn,latency-quantiles.png,latency-raw.png,rate.png}' Sequential_subcritical-skews+start-kill-2
[13:45:31][/subcritical-skews+start-kill-2] Executing: program /usr/bin/ssh host 35.190.162.127, user ubuntu, command scp -v -r -f jepsen/cockroachdb/store/latest/{test.fressian,results.edn,latency-quantiles.png,latency-raw.png,rate.png}
[13:45:31][/subcritical-skews+start-kill-2] OpenSSH_7.2p2 Ubuntu-4ubuntu2.2, OpenSSL 1.0.2g  1 Mar 2016
[13:45:31][/subcritical-skews+start-kill-2] debug1: Reading configuration data /etc/ssh/ssh_config
[13:45:31][/subcritical-skews+start-kill-2] debug1: /etc/ssh/ssh_config line 19: Applying options for *
[13:45:31][/subcritical-skews+start-kill-2] debug2: resolving "35.190.162.127" port 22
[13:45:31][/subcritical-skews+start-kill-2] debug2: ssh_connect_direct: needpriv 0
[13:45:31][/subcritical-skews+start-kill-2] debug1: Connecting to 35.190.162.127 [35.190.162.127] port 22.
[13:47:38][/subcritical-skews+start-kill-2] debug1: connect to address 35.190.162.127 port 22: Connection timed out
[13:47:38][/subcritical-skews+start-kill-2] ssh: connect to host 35.190.162.127 port 22: Connection timed out

The subcritical-skews nemesis does a lot of ssh commands internally, and the first few failures are hitting those commands and killing only that test. Then in JepsenSequential/subcritical-skews+start-kill-2, we hit this error in one of the final post-test scp commands, which gets caught in a place that cleans up the test cluster and fails the rest of the tests.

This looks like it's probably some sort of anti-brute-force rate limiting in sshd (or some firewall in front of it). The best fix for this is probably to either whitelist the controller's IP address in the worker machines' configurations, or maybe enable ControlMaster in the controller's .ssh/config.

jordanlewis commented 6 years ago

Oh I bet I know what this is. There's something called sshguard that was getting installed on clusters created with roachprod automatically by GCE which was causing new connections to hang sometimes. Once I killed sshguard the problem went away.

I believe @mberhault disabled sshguard on the roachprod instances - we probably need to do the same thing here.