Closed cockroach-teamcity closed 1 year ago
ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:
=== RUN TestStreamingRegionalConstraint
test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/74f3eeb4519148f71482aeda8a540383/logTestStreamingRegionalConstraint2541785297
test_log_scope.go:81: use -show-logs to present logs inline
pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
ts, ... := serverutils.StartServer(t, ...)
defer ts.Stopper().Stop(...)
to:
srv, ... := serverutils.StartServer(t, ...)
defer srv.Stopper().Stop(...)
ts := srv.ApplicationLayer()
See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.
pkg/ccl/streamingccl/streamingest/replication_stream_e2e_test.go:1221: (TestStreamingRegionalConstraint)
NOTICE: .DB() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().DB() instead.
TIP: consider replacing the test server initialization from:
ts, ... := serverutils.StartServer(t, ...)
defer ts.Stopper().Stop(...)
to:
srv, ... := serverutils.StartServer(t, ...)
defer srv.Stopper().Stop(...)
ts := srv.ApplicationLayer()
See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
replication_stream_e2e_test.go:1220: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 3
panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/74f3eeb4519148f71482aeda8a540383/logTestStreamingRegionalConstraint2541785297
--- FAIL: TestStreamingRegionalConstraint (64.20s)
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)
ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:
=== RUN TestStreamingRegionalConstraint
test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint2950345968
test_log_scope.go:81: use -show-logs to present logs inline
pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
ts, ... := serverutils.StartServer(t, ...)
defer ts.Stopper().Stop(...)
to:
srv, ... := serverutils.StartServer(t, ...)
defer srv.Stopper().Stop(...)
ts := srv.ApplicationLayer()
See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.
pkg/ccl/streamingccl/streamingest/replication_stream_e2e_test.go:1221: (TestStreamingRegionalConstraint)
NOTICE: .DB() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().DB() instead.
TIP: consider replacing the test server initialization from:
ts, ... := serverutils.StartServer(t, ...)
defer ts.Stopper().Stop(...)
to:
srv, ... := serverutils.StartServer(t, ...)
defer srv.Stopper().Stop(...)
ts := srv.ApplicationLayer()
See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
replication_stream_e2e_test.go:1220: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 3
panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint2950345968
--- FAIL: TestStreamingRegionalConstraint (123.16s)
Parameters: TAGS=bazel,gss,deadlock
, stress=true
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)
ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:
=== RUN TestStreamingRegionalConstraint
test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint48417678
test_log_scope.go:81: use -show-logs to present logs inline
pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
ts, ... := serverutils.StartServer(t, ...)
defer ts.Stopper().Stop(...)
to:
srv, ... := serverutils.StartServer(t, ...)
defer srv.Stopper().Stop(...)
ts := srv.ApplicationLayer()
See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.
pkg/ccl/streamingccl/streamingest/replication_stream_e2e_test.go:1221: (TestStreamingRegionalConstraint)
NOTICE: .DB() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().DB() instead.
TIP: consider replacing the test server initialization from:
ts, ... := serverutils.StartServer(t, ...)
defer ts.Stopper().Stop(...)
to:
srv, ... := serverutils.StartServer(t, ...)
defer srv.Stopper().Stop(...)
ts := srv.ApplicationLayer()
See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
replication_stream_e2e_test.go:1220: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 2
panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint48417678
--- FAIL: TestStreamingRegionalConstraint (52.66s)
Parameters: TAGS=bazel,gss
, stress=true
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)
@stevendanna a git bisect revealed that https://github.com/cockroachdb/cockroach/pull/111178 causes
./dev test pkg/ccl/streamingccl/streamingest -f TestStreamingRegionalConstraint --stress --cpus=8 --count=20
to consisently fail on my gce worker. If I run the same command on the previous commit, the stress test passes no problem. Do you have any intuition on why the splits pr could affect our ability replicate spanconfigs?
The most benign explanation of this regression: the initial splits lead to an occasional test suffering from more severe hardware starvation. When I modify the test to relax the time requirement to observe the replicated span configs by 5x, the test passes under stress 100 runs on my gce worker:
On 4 parallel cpus:
//pkg/ccl/streamingccl/streamingest:streamingest_test PASSED in 67.0s
Stats over 100 runs: max = 67.0s, min = 7.2s, avg = 8.8s, dev = 5.9s
On 8 parallel cpus:
//pkg/ccl/streamingccl/streamingest:streamingest_test PASSED in 202.9s
Stats over 100 runs: max = 202.9s, min = 9.2s, avg = 28.8s, dev = 39.7s
But with 12 parallel cpus: a test will fail after 3m45s. So, all this suggests that hardware exhaustion correlates to the this test timeout. We still ought to figure out how to speed up this test.
Just noticed something fishy after bumping the vmodule on ingest_span_configs.go
. Before I explain that, recall the timeline of this test:
On a green run: we observe the flushing of an incremental span config update, which presumably is the region config update:
❯ grep "ingest_span" out.txt
I231004 18:12:19.029687 21135 ccl/streamingccl/streamingest/ingest_span_configs.go:108 [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 704 initialized span config ingestor
I231004 18:12:19.042198 21283 ccl/streamingccl/streamingest/ingest_span_configs.go:255 [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 708 flushing span config 45 updates and 0 deletes
I231004 18:12:19.042242 21283 ccl/streamingccl/streamingest/ingest_span_configs.go:307 [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 709 flushing full span configuration state (45 records)
I231004 18:12:20.193548 21283 ccl/streamingccl/streamingest/ingest_span_configs.go:255 [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 718 flushing span config 1 updates and 0 deletes
On a failed run, we don't observe the incremental update on the destination side. But we do see the initial flush of the span config state:
❯ grep "ingest_span" streamingesttest.log
I231004 18:07:10.480362 55 util/log/file_sync_buffer.go:238 ⋮ [config] arguments: [‹/home/michaelbutler/.cache/bazel/_bazel_michaelbutler/338fb3aeb5cca80cb3648387a412fd28/sandbox/linux-sandbox/3802/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test› ‹-test.timeout=895s› ‹-vmodule=ingest_span_configs=2›]
I231004 18:07:21.285026 23789 ccl/streamingccl/streamingest/ingest_span_configs.go:108 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905617212680601603›] 698 initialized span config ingestor
I231004 18:07:21.319160 24050 ccl/streamingccl/streamingest/ingest_span_configs.go:255 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905617212680601603›] 702 flushing span config 46 updates and 0 deletes
I231004 18:07:21.319200 24050 ccl/streamingccl/streamingest/ingest_span_configs.go:307 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905617212680601603›] 703 flushing full span configuration state (46 records)
So for whatever reason, either the span config event stream has not sent over the update or a subsequent checkpoint that would induce a destination side flush.
Alright, I've confirmed with a bit more logging that the regional constraint was replicated and flushed to the destination side span config table (see below). The remaining mystery, which is outside of c2c land: why does it take so long to observe the regional constraints on the destination side.
I231004 19:16:43.593958 66 util/log/file_sync_buffer.go:238 ⋮ [config] arguments: [‹/home/michaelbutler/.cache/bazel/_bazel_michaelbutler/338fb3aeb5cca80cb3648387a412fd28/sandbox/linux-sandbox/5529/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test› ‹-test.timeout=895s› ‹-vmodule=ingest_span_configs=2,span_config_event_stream=2›]
I231004 19:16:54.714223 24378 ccl/streamingccl/streamingest/ingest_span_configs.go:108 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 716 initialized span config ingestor
I231004 19:16:54.741312 24567 ccl/streamingccl/streamproducer/span_config_event_stream.go:186 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 718 observed complete scan
I231004 19:16:54.741400 24565 ccl/streamingccl/streamproducer/span_config_event_stream.go:267 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 719 observed constraint update [‹+region=mars›]
I231004 19:16:54.742470 24565 ccl/streamingccl/streamproducer/span_config_event_stream.go:274 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 720 checkpointing span config stream at 2023-10-04 19:16:54.740267269 +0000 UTC
I231004 19:16:54.742510 24565 ccl/streamingccl/streamproducer/span_config_event_stream.go:276 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 721 sending 49 span config events
I231004 19:16:54.758117 24466 ccl/streamingccl/streamingest/ingest_span_configs.go:255 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 723 flushing span config 46 updates and 0 deletes
I231004 19:16:54.758150 24466 ccl/streamingccl/streamingest/ingest_span_configs.go:307 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 724 flushing full span configuration state (46 records)
I have one theory for why the split pr could be causing this test to slowdown: According to the test logs, we issue one initial split at the tenant start key:
❯ cat out.txt | grep "splitting and scattering"
I231004 19:16:54.651275 22316 ccl/streamingccl/streamingest/stream_ingestion_dist.go:362 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 712 splitting and scattering at /Tenant/2
I would expect this split to noop, since I believe there already exists a split at the tenant start key. More interestingly, we also call scatter on this tenant's key space, which will induce some allocator work. If some of this allocator work is asynchronous, then the span config replication stream will begin while the allocator is handling the scatter. The work to apply the regional constraint would then get put in the allocator queue behind this scatter request.
Another thought with no evidence: perhaps admission control is throttling the allocator more, after it handled the admin scatter request.
Okie dokie, my theory above has some weight: commenting out the scatter request in stream_ingestion_dist.go
erases the regression.
Per @kvoli 's advice, I took a look at the kv distributions logs of a failed run. I noticed that after the scatter request and the span config update, I see the allocator attempt to transfer the tenant's range to s1 (perhaps to obey the regional constraint), but the attempt got rejected, then there's nothing in the logs for 20 seconds.
I231005 15:42:37.155598 24698 13@kv/kvserver/allocator/plan/replicate.go:1046 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 1180 transferring lease to s1
I231005 15:42:37.155635 24698 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 1181 transferring lease to s1
I231005 15:42:37.158345 25416 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n1,replicate,s1,r86/1:‹/Tenant/{2/Table/…-3}›] 1182 error processing replica: no removable replicas from range that needs a removal: ‹[1*:19, 3:0]›
I231005 15:42:56.761708 49682 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n1,replicate,s1,r34/1:‹/{NamespaceTab…-Table/32}›] 1183 transferring lease to s3
here's a zip of the latest logs:
Two more things to note about the kv distribution log:
cockroach.log
indicates that it occurred successfully.I added more verbose logging (replicate=6
) which revealed the allocator tried, for about 10 seconds, to rebalance the range to obey the regional constraint, but I don't see any logs which indicate any success. I'm not sure how to dig further in these logs.
For example:
❯ cat streamingesttest-kv-distribution.log| grep "mars" | { head -n 5; tail -n 5; }
I231005 18:11:50.502240 23678 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2975 computing range action desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.502598 24272 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2979 planning range change desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.502745 24402 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2981 computing range action desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.503042 24272 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2984 computing range action desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.503276 24219 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2986 planning range change desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:51.332612 25644 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3088 computing range action desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=4, gen=15] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:51.332873 25874 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3090 planning range change desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=4, gen=15] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:51.415052 25874 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3101 planning range change desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=4, gen=15] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:12:01.823578 37698 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3127 computing range action desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=5, gen=17] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:12:01.824062 37609 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3129 planning range change desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=5, gen=17] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
To repro, run the following command on this branch:
./dev test pkg/ccl/streamingccl/streamingest -f TestStreamingRegionalConstraint --vmodule ingest_span_configs=2,span_config_event_stream=2,replicate=6 --stress --count=100 --cpus=8
Here are the test logs with verbose logging: testlogs.zip
handing this over @stevendanna while I'm out.
I'm removing the release blocker on this as the failure is to due test environment constraints outlined in https://github.com/cockroachdb/cockroach/issues/112541
ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ fad649d89721ddb3e9f3dcab1ad5d14f74c91bc9:
Parameters:
TAGS=bazel,gss,deadlock
,stress=true
Help
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)
/cc @cockroachdb/disaster-recovery
This test on roachdash | Improve this report!
Jira issue: CRDB-31957