Closed cockroach-teamcity closed 6 years ago
These are all ssh timeouts (the first we've captured since adding -v
to our final scp
command in #19875)
[13:45:31][/subcritical-skews+start-kill-2] +(teamcity-jepsen-run-one.sh) scp -o ServerAliveInterval=60 -o 'StrictHostKeyChecking no' -i /home/agent/.ssh/google_compute_engine -C -r -v -v -v 'ubuntu@35.190.162.127:jepsen/cockroachdb/store/latest/{test.fressian,results.edn,latency-quantiles.png,latency-raw.png,rate.png}' Sequential_subcritical-skews+start-kill-2
[13:45:31][/subcritical-skews+start-kill-2] Executing: program /usr/bin/ssh host 35.190.162.127, user ubuntu, command scp -v -r -f jepsen/cockroachdb/store/latest/{test.fressian,results.edn,latency-quantiles.png,latency-raw.png,rate.png}
[13:45:31][/subcritical-skews+start-kill-2] OpenSSH_7.2p2 Ubuntu-4ubuntu2.2, OpenSSL 1.0.2g 1 Mar 2016
[13:45:31][/subcritical-skews+start-kill-2] debug1: Reading configuration data /etc/ssh/ssh_config
[13:45:31][/subcritical-skews+start-kill-2] debug1: /etc/ssh/ssh_config line 19: Applying options for *
[13:45:31][/subcritical-skews+start-kill-2] debug2: resolving "35.190.162.127" port 22
[13:45:31][/subcritical-skews+start-kill-2] debug2: ssh_connect_direct: needpriv 0
[13:45:31][/subcritical-skews+start-kill-2] debug1: Connecting to 35.190.162.127 [35.190.162.127] port 22.
[13:47:38][/subcritical-skews+start-kill-2] debug1: connect to address 35.190.162.127 port 22: Connection timed out
[13:47:38][/subcritical-skews+start-kill-2] ssh: connect to host 35.190.162.127 port 22: Connection timed out
The subcritical-skews
nemesis does a lot of ssh
commands internally, and the first few failures are hitting those commands and killing only that test. Then in JepsenSequential/subcritical-skews+start-kill-2
, we hit this error in one of the final post-test scp
commands, which gets caught in a place that cleans up the test cluster and fails the rest of the tests.
This looks like it's probably some sort of anti-brute-force rate limiting in sshd
(or some firewall in front of it). The best fix for this is probably to either whitelist the controller's IP address in the worker machines' configurations, or maybe enable ControlMaster in the controller's .ssh/config
.
Oh I bet I know what this is. There's something called sshguard
that was getting installed on clusters created with roachprod
automatically by GCE which was causing new connections to hang sometimes. Once I killed sshguard
the problem went away.
I believe @mberhault disabled sshguard
on the roachprod
instances - we probably need to do the same thing here.
The following tests appear to have failed:
#405731:
Please assign, take a look and update the issue accordingly.