cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

roachtest: jepsen/bank/majority-ring failed #90605

Closed cockroach-teamcity closed 1 year ago

cockroach-teamcity commented 2 years ago

roachtest.jepsen/bank/majority-ring failed with artifacts on release-22.2.0 @ e31e79a75b4a40aefe6a102d83004e4d710addff:

test artifacts and logs in: /artifacts/jepsen/bank/majority-ring/run_1
    jepsen.go:300,jepsen.go:361,test_runner.go:930: timed out

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-20856

srosenberg commented 2 years ago

It looks like jepsen run completed @10:41 but didn't exit until it was killed due to timeout @11:15?

2022-10-25 10:41:51,609{GMT}    INFO    [jepsen test runner] jepsen.core: Run complete, writing
2022-10-25 10:41:53,014{GMT}    INFO    [jepsen test runner] jepsen.core: Snarfing log files
2022-10-25 10:41:53,018{GMT}    INFO    [jepsen node 10.142.2.66] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr
2022-10-25 10:41:53,018{GMT}    INFO    [jepsen node 10.142.0.36] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr
2022-10-25 10:41:53,018{GMT}    INFO    [jepsen node 10.142.2.2] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr
2022-10-25 10:41:53,018{GMT}    INFO    [jepsen node 10.142.1.38] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr
2022-10-25 10:41:53,020{GMT}    INFO    [jepsen node 10.142.1.115] jepsen.core: downloading /opt/cockroach/logs/cockroach.stderr to cockroach.stderr
2022-10-25 10:41:53,070{GMT}    INFO    [jepsen node 10.142.2.2] jepsen.core: downloading /opt/cockroach/logs/version.txt to version.txt
2022-10-25 10:41:53,073{GMT}    INFO    [jepsen node 10.142.0.36] jepsen.core: downloading /opt/cockroach/logs/version.txt to version.txt
2022-10-25 10:41:53,074{GMT}    INFO    [jepsen node 10.142.1.38] jepsen.core: downloading /opt/cockroach/logs/version.txt to version.txt
2022-10-25 10:41:53,074{GMT}    INFO    [jepsen node 10.142.2.66] jepsen.core: downloading /opt/cockroach/logs/version.txt to version.txt
2022-10-25 10:41:53,080{GMT}    INFO    [jepsen node 10.142.2.2] jepsen.core: downloading /opt/cockroach/logs/trace.pcap to trace.pcap
2022-10-25 10:41:53,080{GMT}    INFO    [jepsen node 10.142.0.36] jepsen.core: downloading /opt/cockroach/logs/trace.pcap to trace.pcap
2022-10-25 10:41:53,083{GMT}    INFO    [jepsen node 10.142.1.38] jepsen.core: downloading /opt/cockroach/logs/trace.pcap to trace.pcap
2022-10-25 10:41:53,084{GMT}    INFO    [jepsen node 10.142.1.115] jepsen.core: downloading /opt/cockroach/logs/version.txt to version.txt
2022-10-25 10:41:53,090{GMT}    INFO    [jepsen node 10.142.2.66] jepsen.core: downloading /opt/cockroach/logs/trace.pcap to trace.pcap
2022-10-25 10:41:53,144{GMT}    INFO    [jepsen node 10.142.1.115] jepsen.core: downloading /opt/cockroach/logs/trace.pcap to trace.pcap
2022-10-25 11:15:23,787{GMT}    INFO    [Jepsen shutdown hook] jepsen.core: Downloading DB logs before JVM shutdown...
srosenberg commented 2 years ago

From the thread dump in invoke.log, this appears to be stuck on scp,

"jepsen node 10.142.2.2" #444 prio=5 os_prio=0 tid=0x00007f04c4008000 nid=0x16bd9 sleeping[0x00007f048b3f1000]
   java.lang.Thread.State: RUNNABLE
        at clj_ssh.ssh$scp_receive_command$fn__1876.invoke(ssh.clj:870)
        at clj_ssh.ssh$scp_receive_command.invokeStatic(ssh.clj:869)
        at clj_ssh.ssh$scp_receive_command.invoke(ssh.clj:864)
        at clj_ssh.ssh$scp_sink.invokeStatic(ssh.clj:980)
        at clj_ssh.ssh$scp_sink.invoke(ssh.clj:977)
        at clj_ssh.ssh$scp_from.invokeStatic(ssh.clj:1094)
        at clj_ssh.ssh$scp_from.doInvoke(ssh.clj:1044)
        at clojure.lang.RestFn.invoke(RestFn.java:445)
        at clojure.lang.AFn.applyToHelper(AFn.java:160)
        at clojure.lang.RestFn.applyTo(RestFn.java:132)
        at clojure.core$apply.invokeStatic(core.clj:648)
        at clojure.core$apply.invoke(core.clj:641)
        at jepsen.control$download$fn__2079.invoke(control.clj:222)
renatolabs commented 2 years ago

Double checked the files here, and I think your analysis makes sense, it seems it hangs on scp until timeout.

aliher1911 commented 2 years ago

It does look like an scp flake to me. Jepsen is not robust in these terms because it was not written with CI usage in mind.

But maybe it's right time to resurrect the changes that I made a year ago and bump the version to make it more stable and reenable more tests?

srosenberg commented 2 years ago

It does look like an scp flake to me. Jepsen is not robust in these terms because it was not written with CI usage in mind.

@aliher1911 Do you happen to know what it could have been copying? From the logs it seems that all tcpdump pcaps have been successfully transferred?

But maybe it's right time to resurrect the changes that I made a year ago and bump the version to make it more stable and reenable more tests?

That does sound promising! Where are those changes?

aliher1911 commented 1 year ago

@aliher1911 Do you happen to know what it could have been copying? From the logs it seems that all tcpdump pcaps have been successfully transferred?

I don't think this is correct. Logging is done prior to downloading the file and there's no pcap file for 10.142.2.2 but there's a log line

INFO [2022-10-25 10:41:53,080] jepsen node 10.142.2.2 - jepsen.core downloading /opt/cockroach/logs/trace.pcap to trace.pcap

and there are a bunch of threads still handling ssh connection to that host. One looks like a control connection and another one trying to read and waiting in at java.io.PipedInputStream.awaitSpace(PipedInputStream.java:273).

If you look on invocation.log there are two thread dumps there, one for the build tool invoking jepsen (first one) and actual jepsen clujure program where those threads reside.

That does sound promising! Where are those changes?

Changes were bumping jepsen version which upgraded dependencies and I think addressed some of the error handling problems. They were not landed because we were close to release and noone picked that up after.

srosenberg commented 1 year ago

We should see an improvement now that prebuilt jar is used [1]. Thanks @aliher1911!

[1] https://github.com/cockroachdb/cockroach/pull/91298