Closed cockroach-teamcity closed 4 years ago
panic: test timed out after 45m0s CC @danhhz
Subsequent build passed, so maybe something is flaky or the timeout was too low.
I stressrace
d this test for about an hour (with a 45s
timeout instead of a 45m
timeout) and couldn't reproduce it.
In the stacktrace, I see it the test stuck on a TRUNCATE https://github.com/cockroachdb/cockroach/blob/6d75f105bc8998cbcdf7dd35146eed53175784c2/pkg/ccl/changefeedccl/changefeed_test.go#L864
goroutine 63567 [IO wait, 41 minutes]:
internal/poll.runtime_pollWait(0x7fd20023dea8, 0x72, 0x6a049e0)
/usr/local/go/src/runtime/netpoll.go:182 +0x56
internal/poll.(*pollDesc).wait(0xc0076a3d98, 0x72, 0x700, 0x791, 0xffffffffffffffff)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0xe5
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0076a3d80, 0xc009f56000, 0x791, 0x791, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:169 +0x221
net.(*netFD).Read(0xc0076a3d80, 0xc009f56000, 0x791, 0x791, 0x146, 0x146, 0x141)
/usr/local/go/src/net/fd_unix.go:202 +0x66
net.(*conn).Read(0xc0027c7ed0, 0xc009f56000, 0x791, 0x791, 0xdc729d, 0xc005b97758, 0x7ffff80000000000)
/usr/local/go/src/net/net.go:177 +0xa2
crypto/tls.(*atLeastReader).Read(0xc00e6139a0, 0xc009f56000, 0x791, 0x791, 0xc00e6139a0, 0xc00c445180, 0xc00df0ad18)
/usr/local/go/src/crypto/tls/conn.go:761 +0xa7
bytes.(*Buffer).ReadFrom(0xc005b97758, 0x69fc1e0, 0xc00e6139a0, 0x5ae56a0, 0x6a01d40, 0x0)
/usr/local/go/src/bytes/buffer.go:207 +0x165
crypto/tls.(*Conn).readFromUntil(0xc005b97500, 0x6a01d40, 0xc0027c7ed0, 0x5, 0xc0027c7ed0, 0xc00df0abb8)
/usr/local/go/src/crypto/tls/conn.go:783 +0x222
crypto/tls.(*Conn).readRecordOrCCS(0xc005b97500, 0x5d09b00, 0xc005b97638, 0x0)
/usr/local/go/src/crypto/tls/conn.go:590 +0x2e4
crypto/tls.(*Conn).readRecord(...)
/usr/local/go/src/crypto/tls/conn.go:558
crypto/tls.(*Conn).Read(0xc005b97500, 0xc00d25c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/crypto/tls/conn.go:1236 +0x1f6
bufio.(*Reader).Read(0xc006a23140, 0xc00d1c4860, 0x5, 0x200, 0x5d09b88, 0xc000d407d0, 0xc00dbc2a50)
/usr/local/go/src/bufio/bufio.go:223 +0x7bc
io.ReadAtLeast(0x69fbfe0, 0xc006a23140, 0xc00d1c4860, 0x5, 0x200, 0x5, 0x86, 0x0, 0x0)
/usr/local/go/src/io/io.go:310 +0x96
io.ReadFull(...)
/usr/local/go/src/io/io.go:329
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).recvMessage(0xc00d1c4840, 0xc001340c58, 0x1, 0xc00d1c4860, 0xc00dbc2c00)
/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:929 +0x236
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).recv1Buf(0xc00d1c4840, 0xc001340c58, 0x1fb)
/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:978 +0x47
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).recv1(...)
/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:999
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).simpleExec(0xc00d1c4840, 0x5bd382b, 0x1f, 0xc00dbc2d50, 0x0, 0x0, 0xc00dbc2d00, 0xcc64cc, 0x0)
/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:603 +0x28d
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).Exec(0xc00d1c4840, 0x5bd382b, 0x1f, 0xa36a000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:857 +0x2bf
[15:17:29] : [TestChangefeedTruncateRenameDrop/enterprise] [Test Output]
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).ExecContext(0xc00d1c4840, 0x6a767c0, 0xc000104010, 0x5bd382b, 0x1f, 0xa36a000, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn_go18.go:42 +0x226
database/sql.ctxDriverExec(0x6a767c0, 0xc000104010, 0x7fd20024e8e0, 0xc00d1c4840, 0x0, 0x0, 0x5bd382b, 0x1f, 0xa36a000, 0x0, ...)
/usr/local/go/src/database/sql/ctxutil.go:31 +0x2e1
database/sql.(*DB).execDC.func2()
/usr/local/go/src/database/sql/sql.go:1467 +0x2c3
database/sql.withLock(0x6a2f240, 0xc0076a3e00, 0xc00dbc3150)
/usr/local/go/src/database/sql/sql.go:3097 +0x75
database/sql.(*DB).execDC(0xc008cde6c0, 0x6a767c0, 0xc000104010, 0xc0076a3e00, 0xc00dbc3280, 0x5bd382b, 0x1f, 0x0, 0x0, 0x0, ...)
/usr/local/go/src/database/sql/sql.go:1462 +0x4d9
database/sql.(*DB).exec(0xc008cde6c0, 0x6a767c0, 0xc000104010, 0x5bd382b, 0x1f, 0x0, 0x0, 0x0, 0xc00dbc3301, 0xdde0ff, ...)
/usr/local/go/src/database/sql/sql.go:1447 +0x175
database/sql.(*DB).ExecContext(0xc008cde6c0, 0x6a767c0, 0xc000104010, 0x5bd382b, 0x1f, 0x0, 0x0, 0x0, 0x55ad5c0, 0xc00e6138e0, ...)
/usr/local/go/src/database/sql/sql.go:1425 +0xef
github.com/cockroachdb/cockroach/pkg/testutils/sqlutils.(*SQLRunner).Exec(0xc00dbc3518, 0x6afd060, 0xc000d07600, 0x5bd382b, 0x1f, 0x0, 0x0, 0x0, 0x0, 0x6a94040)
/go/src/github.com/cockroachdb/cockroach/pkg/testutils/sqlutils/sql_runner.go:52 +0x100
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.TestChangefeedTruncateRenameDrop.func1(0xc000d07600, 0xc008cde6c0, 0x6a2f740, 0xc00d3a2500)
/go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_test.go:864 +0x48f
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.enterpriseTest.func1(0xc000d07600)
/go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/helpers_test.go:266 +0x847
testing.tRunner(0xc000d07600, 0xc002f99950)
/usr/local/go/src/testing/testing.go:865 +0x164
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:916 +0x65b
Yeah, there's thousands of lines of this in the logs
I191014 14:44:10.368889 63982 sql/sqlbase/structured.go:1516 [n1,client=127.0.0.1:45572,user=root] publish: descID=54 (truncate_cascade) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
I191014 14:44:11.305333 63982 sql/sqlbase/structured.go:1516 [n1,client=127.0.0.1:45572,user=root] publish: descID=53 (truncate) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
@aayushshah15 have you tried roachprod-stressrace? It's not frequent on master, but did happen a few times, so I we need to keep pushing (timeouts are really shitty). There's a chance you have to tweak the -p
flag to STRESSFLAGS
, or try stressing the package instead of the particular test. If that doesn't bear fruit, the next thing to do is to instrument the log messages better so that we can figure out something new the next time it does happen.
https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_UnitTests&testNameId=-8759347941440480963&tab=testDetails is the history of that test, fwiw. @ajwerner I think you stressed that test a while ago when it was flaky, any wisdom to share with @aayushshah15?
I would not be stunned if this is fixed by the bug fix that lives in #41842. I'll take this from @aayushshah15.
Here's my best guess. When I did https://github.com/cockroachdb/cockroach/pull/40581 I forgot to fix https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/truncate.go#L193. This is problematic because it means that TRUNCATE
cannot be pushed. I'm not clear on why we sometimes hit this problem. Maybe it has something to do with bad timing in coordination with reading either the jobs or descriptors tables. I verified that the same behavior occurs if I set a very short closed timestamp duration at the beginning of the test. I'm going to address this by fixing the observation of the commit timnestamp in TRUNCATE TABLE (which should have been done in 19.2 and I might try to advocate for a backport depending on how invasive it gets).
This isn't a changefeed bug per se.
The following tests appear to have failed on release-19.2 (testrace): TestChangefeedTruncateRenameDrop, TestChangefeedTruncateRenameDrop/sinkless, TestChangefeedTruncateRenameDrop/enterprise
You may want to check for open issues.
#1537922:
Please assign, take a look and update the issue accordingly.