cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.15k stars 3.81k forks source link

teamcity: failed test: TestChangefeedTruncateRenameDrop #41566

Closed cockroach-teamcity closed 4 years ago

cockroach-teamcity commented 5 years ago

The following tests appear to have failed on release-19.2 (testrace): TestChangefeedTruncateRenameDrop, TestChangefeedTruncateRenameDrop/sinkless, TestChangefeedTruncateRenameDrop/enterprise

You may want to check for open issues.

#1537922:

TestChangefeedTruncateRenameDrop/enterprise
... minutes]:
database/sql.(*DB).connectionOpener(0xc00db42a80, 0x6a76780, 0xc00accc080)
    /usr/local/go/src/database/sql/sql.go:1000 +0x140
created by database/sql.OpenDB
    /usr/local/go/src/database/sql/sql.go:670 +0x2b4

goroutine 63441 [chan receive]:
github.com/cockroachdb/cockroach/pkg/storage/closedts/provider.(*Provider).Notify.func1(0x6a76840, 0xc001a40510)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/closedts/provider/provider.go:217 +0x1ae
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc006cd27e0, 0xc00abe86e0, 0xc0092790a0)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0x160
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:189 +0xc4

goroutine 63696 [sync.Cond.Wait]:
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:307
sync.runtime_notifyListWait(0xc008a37890, 0xc000025ea0)
    /usr/local/go/src/runtime/sema.go:510 +0xf9
sync.(*Cond).Wait(0xc008a37880)
    /usr/local/go/src/sync/cond.go:56 +0x8e
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc0010314d0, 0x6a76840, 0xc001a40c00)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:192 +0x9c
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x6a76840, 0xc001a40c00)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:161 +0x56
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc0068ccbf0, 0xc00abe86e0, 0xc0068ccbe0)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0x160
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:189 +0xc4

goroutine 65120 [chan receive, 41 minutes]:
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).RangeFeed(0xc00465a900, 0xc00c00b900, 0x6abd9c0, 0xc00c430ec0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_rangefeed.go:229 +0x90d
github.com/cockroachdb/cockroach/pkg/storage.(*Store).RangeFeed(0xc00147bc00, 0xc00c00b900, 0x6abd9c0, 0xc00c430ec0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2213 +0x4c6
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).RangeFeed(0xc007531550, 0xc00c00b900, 0x6abd9c0, 0xc00c430ec0, 0x5ab6980)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:206 +0x24b
github.com/cockroachdb/cockroach/pkg/server.(*Node).RangeFeed(0xc0089d6a00, 0xc00c00b900, 0x6abd9c0, 0xc00c430ec0, 0x6a76780, 0xc002a96300)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:1033 +0x7e
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.RangeFeed.func1(0xc00a2fb870, 0x6a30100, 0xc0089d6a00, 0xc00c00b900, 0x6a76780, 0xc002a96480, 0xc00c356240, 0xc00c3562a0)
    /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:610 +0xcc
created by github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.RangeFeed
    /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:608 +0x12e

goroutine 63745 [select, 41 minutes]:
github.com/cockroachdb/cockroach/pkg/server.periodicallyPersistHLCUpperBound(0xc00129f4f0, 0xc003923340, 0xc0068cd900, 0x5d09da8, 0xc008a8a480, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1030 +0x1f8
github.com/cockroachdb/cockroach/pkg/server.(*Server).startPersistingHLCUpperBound.func2(0x6a76840, 0xc0018b5f80)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1104 +0xde
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc0068cd910, 0xc00abe86e0, 0xc002910690)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0x160
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:189 +0xc4
I191014 14:23:02.501369 1 rand.go:85  Random seed: -3946147195552034504

TestChangefeedTruncateRenameDrop
--- FAIL: testrace/TestChangefeedTruncateRenameDrop (0.000s)
Test ended in panic.

TestChangefeedTruncateRenameDrop/sinkless
...is range at key /Table/60 [r30] (zone config)
I191014 14:26:51.237618 62064 sql/event_log.go:130  [n1,client=127.0.0.1:43882,user=root] Event: "create_table", target: 60, info: {TableName:d.public.drop Statement:CREATE TABLE drop (a INT8 PRIMARY KEY) User:root}
I191014 14:26:51.954272 62064 sql/sqlbase/structured.go:1516  [n1,client=127.0.0.1:43882,user=root] publish: descID=60 (drop) version=2 mtime=1970-01-01 00:00:00 +0000 UTC
I191014 14:26:52.156068 62064 sql/event_log.go:130  [n1,client=127.0.0.1:43882,user=root] Event: "drop_table", target: 60, info: {TableName:d.public.drop Statement:DROP TABLE drop User:root CascadeDroppedViews:[]}
I191014 14:26:52.225290 62064 sql/sqlbase/structured.go:1516  [n1,client=127.0.0.1:43882,user=root,scExec] publish: descID=60 (drop) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
I191014 14:26:52.749493 61341 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] storage.IntentResolver: cleanup txn records
13     gossip update (merge)
10     gossip update (split)
1      [async] transport racer
1      [async] intent_resolver_ir_batcher
1      [async] intent_resolver_gc_batcher
1      [async] closedts-rangefeed-subscriber
I191014 14:26:52.749900 61341 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] storage.IntentResolver: cleanup txn records
13     gossip update (merge)
10     gossip update (split)
1      [async] transport racer
1      [async] intent_resolver_ir_batcher
1      [async] closedts-rangefeed-subscriber
W191014 14:26:52.750060 63439 storage/intentresolver/intent_resolver.go:818  failed to gc transaction record: could not GC completed transaction anchored at /Table/SystemConfigSpan/Start: node unavailable; try another peer
I191014 14:26:52.755778 61528 storage/queue.go:520  [n1,s1] rate limited in gossip update (merge): node unavailable; try another peer
I191014 14:26:52.772680 61388 kv/transport_race.go:108  transport race promotion: ran 18 iterations on up to 1120 requests
I191014 14:26:52.773013 61341 util/stop/stopper.go:542  quiescing; tasks left:
5      gossip update (split)
3      gossip update (merge)
1      [async] transport racer
1      [async] closedts-rangefeed-subscriber
I191014 14:26:52.774161 61341 util/stop/stopper.go:542  quiescing; tasks left:
4      gossip update (split)
3      gossip update (merge)
1      [async] transport racer
1      [async] closedts-rangefeed-subscriber
I191014 14:26:52.774449 61341 util/stop/stopper.go:542  quiescing; tasks left:
4      gossip update (split)
2      gossip update (merge)
1      [async] transport racer
1      [async] closedts-rangefeed-subscriber
I191014 14:26:52.774709 61341 util/stop/stopper.go:542  quiescing; tasks left:
4      gossip update (split)
2      gossip update (merge)
1      [async] closedts-rangefeed-subscriber
I191014 14:26:52.774988 61341 util/stop/stopper.go:542  quiescing; tasks left:
3      gossip update (split)
2      gossip update (merge)
1      [async] closedts-rangefeed-subscriber
I191014 14:26:52.775233 61341 util/stop/stopper.go:542  quiescing; tasks left:
3      gossip update (split)
1      gossip update (merge)
1      [async] closedts-rangefeed-subscriber
I191014 14:26:52.775539 61341 util/stop/stopper.go:542  quiescing; tasks left:
2      gossip update (split)
1      gossip update (merge)
1      [async] closedts-rangefeed-subscriber
I191014 14:26:52.775849 61341 util/stop/stopper.go:542  quiescing; tasks left:
1      gossip update (split)
1      gossip update (merge)
1      [async] closedts-rangefeed-subscriber
I191014 14:26:52.776037 61528 storage/queue.go:520  [n1,s1] rate limited in gossip update (split): node unavailable; try another peer
I191014 14:26:52.776500 61341 util/stop/stopper.go:542  quiescing; tasks left:
1      gossip update (merge)
1      [async] closedts-rangefeed-subscriber
I191014 14:26:52.777106 61341 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-rangefeed-subscriber

Please assign, take a look and update the issue accordingly.

RaduBerinde commented 5 years ago

panic: test timed out after 45m0s CC @danhhz

RaduBerinde commented 5 years ago

Subsequent build passed, so maybe something is flaky or the timeout was too low.

aayushshah15 commented 5 years ago

I stressraced this test for about an hour (with a 45s timeout instead of a 45m timeout) and couldn't reproduce it.

danhhz commented 5 years ago

In the stacktrace, I see it the test stuck on a TRUNCATE https://github.com/cockroachdb/cockroach/blob/6d75f105bc8998cbcdf7dd35146eed53175784c2/pkg/ccl/changefeedccl/changefeed_test.go#L864

goroutine 63567 [IO wait, 41 minutes]:
internal/poll.runtime_pollWait(0x7fd20023dea8, 0x72, 0x6a049e0)
        /usr/local/go/src/runtime/netpoll.go:182 +0x56
internal/poll.(*pollDesc).wait(0xc0076a3d98, 0x72, 0x700, 0x791, 0xffffffffffffffff)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0xe5
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0076a3d80, 0xc009f56000, 0x791, 0x791, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:169 +0x221
net.(*netFD).Read(0xc0076a3d80, 0xc009f56000, 0x791, 0x791, 0x146, 0x146, 0x141)
        /usr/local/go/src/net/fd_unix.go:202 +0x66
net.(*conn).Read(0xc0027c7ed0, 0xc009f56000, 0x791, 0x791, 0xdc729d, 0xc005b97758, 0x7ffff80000000000)
        /usr/local/go/src/net/net.go:177 +0xa2
crypto/tls.(*atLeastReader).Read(0xc00e6139a0, 0xc009f56000, 0x791, 0x791, 0xc00e6139a0, 0xc00c445180, 0xc00df0ad18)
        /usr/local/go/src/crypto/tls/conn.go:761 +0xa7
bytes.(*Buffer).ReadFrom(0xc005b97758, 0x69fc1e0, 0xc00e6139a0, 0x5ae56a0, 0x6a01d40, 0x0)
        /usr/local/go/src/bytes/buffer.go:207 +0x165
crypto/tls.(*Conn).readFromUntil(0xc005b97500, 0x6a01d40, 0xc0027c7ed0, 0x5, 0xc0027c7ed0, 0xc00df0abb8)
        /usr/local/go/src/crypto/tls/conn.go:783 +0x222
crypto/tls.(*Conn).readRecordOrCCS(0xc005b97500, 0x5d09b00, 0xc005b97638, 0x0)
        /usr/local/go/src/crypto/tls/conn.go:590 +0x2e4
crypto/tls.(*Conn).readRecord(...)
        /usr/local/go/src/crypto/tls/conn.go:558
crypto/tls.(*Conn).Read(0xc005b97500, 0xc00d25c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/crypto/tls/conn.go:1236 +0x1f6
bufio.(*Reader).Read(0xc006a23140, 0xc00d1c4860, 0x5, 0x200, 0x5d09b88, 0xc000d407d0, 0xc00dbc2a50)
        /usr/local/go/src/bufio/bufio.go:223 +0x7bc
io.ReadAtLeast(0x69fbfe0, 0xc006a23140, 0xc00d1c4860, 0x5, 0x200, 0x5, 0x86, 0x0, 0x0)
        /usr/local/go/src/io/io.go:310 +0x96
io.ReadFull(...)
        /usr/local/go/src/io/io.go:329
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).recvMessage(0xc00d1c4840, 0xc001340c58, 0x1, 0xc00d1c4860, 0xc00dbc2c00)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:929 +0x236
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).recv1Buf(0xc00d1c4840, 0xc001340c58, 0x1fb)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:978 +0x47
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).recv1(...)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:999
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).simpleExec(0xc00d1c4840, 0x5bd382b, 0x1f, 0xc00dbc2d50, 0x0, 0x0, 0xc00dbc2d00, 0xcc64cc, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:603 +0x28d
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).Exec(0xc00d1c4840, 0x5bd382b, 0x1f, 0xa36a000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:857 +0x2bf
[15:17:29] :                     [TestChangefeedTruncateRenameDrop/enterprise] [Test Output]
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).ExecContext(0xc00d1c4840, 0x6a767c0, 0xc000104010, 0x5bd382b, 0x1f, 0xa36a000, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn_go18.go:42 +0x226
database/sql.ctxDriverExec(0x6a767c0, 0xc000104010, 0x7fd20024e8e0, 0xc00d1c4840, 0x0, 0x0, 0x5bd382b, 0x1f, 0xa36a000, 0x0, ...)
        /usr/local/go/src/database/sql/ctxutil.go:31 +0x2e1
database/sql.(*DB).execDC.func2()
        /usr/local/go/src/database/sql/sql.go:1467 +0x2c3
database/sql.withLock(0x6a2f240, 0xc0076a3e00, 0xc00dbc3150)
        /usr/local/go/src/database/sql/sql.go:3097 +0x75
database/sql.(*DB).execDC(0xc008cde6c0, 0x6a767c0, 0xc000104010, 0xc0076a3e00, 0xc00dbc3280, 0x5bd382b, 0x1f, 0x0, 0x0, 0x0, ...)
        /usr/local/go/src/database/sql/sql.go:1462 +0x4d9
database/sql.(*DB).exec(0xc008cde6c0, 0x6a767c0, 0xc000104010, 0x5bd382b, 0x1f, 0x0, 0x0, 0x0, 0xc00dbc3301, 0xdde0ff, ...)
        /usr/local/go/src/database/sql/sql.go:1447 +0x175
database/sql.(*DB).ExecContext(0xc008cde6c0, 0x6a767c0, 0xc000104010, 0x5bd382b, 0x1f, 0x0, 0x0, 0x0, 0x55ad5c0, 0xc00e6138e0, ...)
        /usr/local/go/src/database/sql/sql.go:1425 +0xef
github.com/cockroachdb/cockroach/pkg/testutils/sqlutils.(*SQLRunner).Exec(0xc00dbc3518, 0x6afd060, 0xc000d07600, 0x5bd382b, 0x1f, 0x0, 0x0, 0x0, 0x0, 0x6a94040)
        /go/src/github.com/cockroachdb/cockroach/pkg/testutils/sqlutils/sql_runner.go:52 +0x100
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.TestChangefeedTruncateRenameDrop.func1(0xc000d07600, 0xc008cde6c0, 0x6a2f740, 0xc00d3a2500)
        /go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_test.go:864 +0x48f
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.enterpriseTest.func1(0xc000d07600)
        /go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/helpers_test.go:266 +0x847
testing.tRunner(0xc000d07600, 0xc002f99950)
        /usr/local/go/src/testing/testing.go:865 +0x164
created by testing.(*T).Run
        /usr/local/go/src/testing/testing.go:916 +0x65b
danhhz commented 5 years ago

Yeah, there's thousands of lines of this in the logs

I191014 14:44:10.368889 63982 sql/sqlbase/structured.go:1516  [n1,client=127.0.0.1:45572,user=root] publish: descID=54 (truncate_cascade) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
I191014 14:44:11.305333 63982 sql/sqlbase/structured.go:1516  [n1,client=127.0.0.1:45572,user=root] publish: descID=53 (truncate) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
tbg commented 5 years ago

@aayushshah15 have you tried roachprod-stressrace? It's not frequent on master, but did happen a few times, so I we need to keep pushing (timeouts are really shitty). There's a chance you have to tweak the -p flag to STRESSFLAGS, or try stressing the package instead of the particular test. If that doesn't bear fruit, the next thing to do is to instrument the log messages better so that we can figure out something new the next time it does happen.

https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_UnitTests&testNameId=-8759347941440480963&tab=testDetails is the history of that test, fwiw. @ajwerner I think you stressed that test a while ago when it was flaky, any wisdom to share with @aayushshah15?

ajwerner commented 5 years ago

I would not be stunned if this is fixed by the bug fix that lives in #41842. I'll take this from @aayushshah15.

ajwerner commented 5 years ago

Here's my best guess. When I did https://github.com/cockroachdb/cockroach/pull/40581 I forgot to fix https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/truncate.go#L193. This is problematic because it means that TRUNCATE cannot be pushed. I'm not clear on why we sometimes hit this problem. Maybe it has something to do with bad timing in coordination with reading either the jobs or descriptors tables. I verified that the same behavior occurs if I set a very short closed timestamp duration at the beginning of the test. I'm going to address this by fixing the observation of the commit timnestamp in TRUNCATE TABLE (which should have been done in 19.2 and I might try to advocate for a backport depending on how invasive it gets).

This isn't a changefeed bug per se.