roachtest: cdc/pubsub-sink failed

cockroach-teamcity commented 5 months ago

roachtest.cdc/pubsub-sink failed with artifacts on release-23.2 @ 8a37c9bfbac9da07c899d4617dbbdb93dd7a5bae:

(latency_verifier.go:199).assertValid: max latency was more than allowed: 1m8.767820312s vs 1m0s
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/cdc/pubsub-sink/run_1

Parameters:

ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/cdc _{This test on roachdash | Improve this report!

Jira issue: CRDB-38148}

andyyang890 commented 5 months ago

From looking at the logs, it seems like the latency (time behind changefeed's highwater) briefly creeped above 1m, but then it went down and stabilized around 20s again. Doesn't seem concerning.

11:18:20 latency_verifier.go:145: test status: watching changefeed[default]: end-to-end steady latency 16.283s; max steady latency so far 22.935s; highwater 2024-04-24 11:18:04.705553324 +0000 UTC
11:18:31 latency_verifier.go:145: test status: watching changefeed[default]: end-to-end steady latency 18.711s; max steady latency so far 22.935s; highwater 2024-04-24 11:18:12.552439962 +0000 UTC
11:18:41 latency_verifier.go:145: test status: watching changefeed[default]: end-to-end steady latency 28.99s; max steady latency so far 28.99s; highwater 2024-04-24 11:18:12.552439962 +0000 UTC
11:18:51 latency_verifier.go:145: test status: watching changefeed[default]: end-to-end steady latency 39.267s; max steady latency so far 39.267s; highwater 2024-04-24 11:18:12.552439962 +0000 UTC
11:19:02 latency_verifier.go:145: test status: watching changefeed[default]: end-to-end steady latency 49.549s; max steady latency so far 49.549s; highwater 2024-04-24 11:18:12.552439962 +0000 UTC
11:19:12 latency_verifier.go:145: test status: watching changefeed[default]: end-to-end steady latency 59.827s; max steady latency so far 59.827s; highwater 2024-04-24 11:18:12.552439962 +0000 UTC
11:19:22 latency_verifier.go:145: test status: watching changefeed[default]: end-to-end steady latency 58.491s; max steady latency so far 1m8.05s; highwater 2024-04-24 11:18:24.167487181 +0000 UTC
11:19:32 latency_verifier.go:145: test status: watching changefeed[default]: end-to-end steady latency 1m8.767s; max steady latency so far 1m8.767s; highwater 2024-04-24 11:18:24.167487181 +0000 UTC
11:19:43 latency_verifier.go:145: test status: watching changefeed[default]: end-to-end steady latency 14.52s; max steady latency so far 1m8.767s; highwater 2024-04-24 11:19:28.688462825 +0000 UTC
11:19:53 latency_verifier.go:145: test status: watching changefeed[default]: end-to-end steady latency 13.633s; max steady latency so far 1m8.767s; highwater 2024-04-24 11:19:39.848680442 +0000 UTC
11:20:03 latency_verifier.go:145: test status: watching changefeed[default]: end-to-end steady latency 12.742s; max steady latency so far 1m8.767s; highwater 2024-04-24 11:19:51.017091921 +0000 UTC
11:20:14 latency_verifier.go:145: test status: watching changefeed[default]: end-to-end steady latency 16.279s; max steady latency so far 1m8.767s; highwater 2024-04-24 11:19:57.754405899 +0000 UTC
11:20:24 latency_verifier.go:145: test status: watching changefeed[default]: end-to-end steady latency 13.665s; max steady latency so far 1m8.767s; highwater 2024-04-24 11:20:10.644693101 +0000 UTC

andyyang890 commented 5 months ago

Digging a little bit more, the changefeed frontier, which is running on node 1, emits logs showing that one of the ranges that has fallen behind belongs to table 114:

I240424 11:18:46.386811 14173 ccl/changefeedccl/changefeed_processors.go:1702 ⋮ [T1,Vsystem,n1,f‹5e9d4042›,job=963007082278420481,distsql.gateway=1,distsql.appname=‹$ internal-resume-job-963007082278420481›] 770  ‹job 963007082278420481› span ‹/Table/114/1/{28/8/-1783/5-33/4/-3001/1}› is behind by 33.834360318s
E240424 11:18:49.115733 564273 (gostd) net/http/server.go:3212 ⋮ [-] 771  ‹http: TLS handshake error from 10.0.0.3:37452: remote error: tls: bad certificate›
I240424 11:18:58.403387 14173 ccl/changefeedccl/changefeed_processors.go:1702 ⋮ [T1,Vsystem,n1,f‹5e9d4042›,job=963007082278420481,distsql.gateway=1,distsql.appname=‹$ internal-resume-job-963007082278420481›] 772  ‹job 963007082278420481› span ‹/Table/114/1/{28/8/-1783/5-33/4/-3001/1}› is behind by 45.850941582s
E240424 11:19:04.115675 576257 (gostd) net/http/server.go:3212 ⋮ [-] 773  ‹http: TLS handshake error from 10.0.0.3:46584: remote error: tls: bad certificate›
I240424 11:19:10.402045 14173 ccl/changefeedccl/changefeed_processors.go:1702 ⋮ [T1,Vsystem,n1,f‹5e9d4042›,job=963007082278420481,distsql.gateway=1,distsql.appname=‹$ internal-resume-job-963007082278420481›] 774  ‹job 963007082278420481› span ‹/Table/114/1/{28/8/-1783/5-33/4/-3001/1}› is behind by 57.849598708s
E240424 11:19:19.113325 587203 (gostd) net/http/server.go:3212 ⋮ [-] 775  ‹http: TLS handshake error from 10.0.0.3:57568: remote error: tls: bad certificate›
I240424 11:19:20.869402 14173 ccl/changefeedccl/changefeed_processors.go:1702 ⋮ [T1,Vsystem,n1,f‹5e9d4042›,job=963007082278420481,distsql.gateway=1,distsql.appname=‹$ internal-resume-job-963007082278420481›] 776  ‹job 963007082278420481› span ‹/Table/114/1/{28/8/-1783/5-33/4/-3001/1}› is behind by 1m8.316955482s

I do also see in the logs for node 2 about 20 seconds later, there are some range splits initiated for table 114:

I240424 11:19:43.609450 318397 kv/kvserver/replica_command.go:439 ⋮ [T1,Vsystem,n2,split,s2,r127/2:‹/Table/114/1/{3/10/-…-28/8/-…}›] 367  initiating a split of this range at key /Table/114/1/‹16›/‹4›/‹-1262›/‹1› [r145] (‹512 MiB above threshold size 512 MiB›)
I240424 11:19:45.557978 318840 kv/kvserver/replica_command.go:439 ⋮ [T1,Vsystem,n2,split,s2,r140/2:‹/Table/114/1/{37/4/-…-62/1/-…}›] 368  initiating a split of this range at key /Table/114/1/‹49›/‹8›/‹-2902›/‹6› [r146] (‹512 MiB above threshold size 512 MiB›)

This coincides with when the latency went down again in the test log (11:19:43).

Maybe the latency went up (i.e. changefeed fell behind) because the range got too big and then it recovered after it was split?

rharding6373 commented 1 month ago

The artifacts are gone and we haven't seen a reoccurrence of this specific issue since it originally occurred. I think we'll have to be satisfied with Andy's hypothesis for now.

cockroachdb / cockroach

roachtest: cdc/pubsub-sink failed #122972