cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

jepsen: Failed to stop ntpd #25834

Closed cockroach-teamcity closed 6 years ago

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/5bc5488c856b8a8b618c239c6616b45be6a1942a

Parameters:

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=673532&tab=buildLog

    jepsen.go:244,jepsen.go:288: /home/agent/work/.go/bin/roachprod run teamcity-673532-jepsen:6 -- bash -e -c "\
        cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
         ~/lein run test \
           --tarball file://${PWD}/cockroach.tgz \
           --username ${USER} \
           --ssh-private-key ~/.ssh/id_rsa \
           --os ubuntu \
           --time-limit 300 \
           --concurrency 30 \
           --recovery-time 25 \
           --test-count 1 \
           -n 10.142.15.216 -n 10.142.15.208 -n 10.142.15.211 -n 10.142.15.212 -n 10.142.15.215 \
           --test g2 --nemesis majority-ring \
        > invoke.log 2>&1 \
        ": exit status 1
bdarnell commented 6 years ago
WARN [2018-05-23 09:17:00,284] jepsen node 10.142.15.208 - jepsen.control Encountered error with conn [:control "10.142.15.208"]; reopening
ERROR [2018-05-23 09:17:01,892] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.util.concurrent.ExecutionException: java.lang.RuntimeException: sudo -S -u root bash -c "cd /; service ntp stop || while ! pgrep ntpd ; do sleep 1 ; service ntp stop || true ; done" returned non-zero exit status -1 on 10.142.15.208. STDOUT:

NTP restarts are still failing with my latest attempt at a restart loop. And it's only taking a little over a second, so it's not a timeout.

I've been duping these onto #20492 but that one's been closed, so I'll leave this one as the new home for this issue.

bdarnell commented 6 years ago

A little more logging with bash -x. Does this mean that the || is not getting interpreted as it should?

ERROR [2018-05-26 14:03:37,815] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.util.concurrent.ExecutionException: java.lang.RuntimeException: sudo -S -u root bash -x -c "cd /; service ntp stop || while ! pgrep ntpd ; do sleep 1 ; service ntp stop || true ; done" returned non-zero exit status -1 on 10.128.0.8. STDOUT:

STDERR:
+ cd /
+ service ntp stop
petermattis commented 6 years ago

Agreed that looks weird. What happened to the rest of the bash command? It's crazy how difficult it is to reliably stop ntp.

bdarnell commented 6 years ago

To diagnose these: go to the artifacts tab in the teamcity run, open the jepsen/X/Y subdirectory and look at test.log. For the NTP stop issue, the log will be short with the service ntp stop exception in the middle of it.

bdarnell commented 6 years ago

I noticed something new here. I've been running with bash -x to see what's going on, and most failures now look like this:

ERROR [2018-07-07 13:32:00,031] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.util.concurrent.ExecutionException: java.lang.RuntimeException: sudo -S -u root bash -x -c "cd /; service ntp stop || while ! pgrep ntpd ; do sleep 1 ; service ntp stop || true ; done" returned non-zero exit status -1 on 10.128.0.26. STDOUT:

STDERR:
+ cd /
+ service ntp stop

This has suggested that something is going on in that shell command (like the pipe characters getting escaped or interpreted by the wrong layer of the clojure/ssh/shell stack). However, our most recent failure produced the error code of -1 without any bash output:

ERROR [2018-07-07 14:28:39,513] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.util.concurrent.ExecutionException: java.lang.RuntimeException: sudo -S -u root bash -x -c "cd /; service ntp stop || while ! pgrep ntpd ; do sleep 1 ; service ntp stop || true ; done" returned non-zero exit status -1 on 10.128.0.27. STDOUT:

STDERR:

This looks like instead of the shell exiting with an error, the shell and/or connection is somehow getting killed and the output may or may not have a chance to make it out in time.

So here's a theory: ntpd is getting managed by systemd as a legacy init.d-based service. This means it's using pidfiles instead of more reliable child-process tracking. If a stale pidfile exists, service ntp stop may kill an arbitrary process. There are several chances for this to strike: the sshd process itself, sudo, the inner shell (although would any of these but the sshd process produce an exit code of -1?).

The biggest hole in the theory is that this error can (and often does) occur more than once in the same 80-test run, indicating a fairly high probability. That doesn't completely disqualify it (we churn through a lot of pids in these tests, and at a fairly predictable rate, so it's possible that the number of pids used in a test aligns with the rate at which pids get recycled), but it seems pretty unlikely. This would also mean that something is recreating the pidfile after it is deleted (or it fails to get deleted by service ntp stop). I haven't seen any evidence of something that would try to do that.

bdarnell commented 6 years ago

From additional debugging (in #27394), it's not a stale pid file - no node has a /var/run/ntp* pidfile when the failing test run starts. And it doesn't look like a problem of escaping the pipe characters, as I had suspected:

10075 ?        S      0:00 bash -x -c cd /; ps ax ; service ntp stop || while ! pgrep ntpd ; do sleep 1 ; service ntp stop || true ; done

(Since ps doesn't distinguish spaces within and between arguments, it's still possible that there's an escaping problem that's preventing the entire command from being run, but I can't find any way to break this statement that results in an exit status of -1 with no other output, nor can I see anything in the stack that would plausibly cause such a parsing failure)

On further testing, I see that I've gotten that while condition backwards, so this loop would be infinite if it were ever executed. But that's not the failure mode we're seeing; when it fails it does so quickly (and the -x trace shows no evidence of anything after || being executed).

bdarnell commented 6 years ago

Rewriting to minimize special characters doesn't seem to have helped:

ERROR [2018-07-16 18:27:28,756] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.util.concurrent.ExecutionException: java.lang.RuntimeException: sudo -S -u root bash -x -c "cd /; if service ntp stop ; then true ; else while pgrep ntpd ; do sleep 1 ; service ntp stop || true ; done fi" returned non-zero exit status -1 on 10.128.0.10. STDOUT:

STDERR:
+ cd /
+ service ntp stop
bdarnell commented 6 years ago

cockroachdb/jepsen#18 seems to have finally fixed this.

tbg commented 6 years ago

Can't believe we're getting to see the day.

On Wed, Aug 15, 2018, 8:55 PM Ben Darnell notifications@github.com wrote:

Closed #25834 https://github.com/cockroachdb/cockroach/issues/25834.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/25834#event-1790724932, or mute the thread https://github.com/notifications/unsubscribe-auth/AE135K1H_PycaatU0R11AktQZjQvutcOks5uRG6ngaJpZM4UKEjw .

--

-- Tobias