cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

ccl/streamingccl/streamingest: TestStreamingRegionalConstraint failed #111541

Closed cockroach-teamcity closed 1 year ago

cockroach-teamcity commented 1 year ago

ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ fad649d89721ddb3e9f3dcab1ad5d14f74c91bc9:

=== RUN   TestStreamingRegionalConstraint
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint954244461
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
    NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
    NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.
    replication_stream_e2e_test.go:1217: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 2
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint954244461
--- FAIL: TestStreamingRegionalConstraint (108.11s)

Parameters: TAGS=bazel,gss,deadlock , stress=true

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-31957

michae2 commented 1 year ago

Another instance: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/12008226?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

cockroach-teamcity commented 1 year ago

ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:

=== RUN   TestStreamingRegionalConstraint
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/74f3eeb4519148f71482aeda8a540383/logTestStreamingRegionalConstraint2541785297
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
    NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
    NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.

pkg/ccl/streamingccl/streamingest/replication_stream_e2e_test.go:1221: (TestStreamingRegionalConstraint)
    NOTICE: .DB() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().DB() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
    replication_stream_e2e_test.go:1220: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 3
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/74f3eeb4519148f71482aeda8a540383/logTestStreamingRegionalConstraint2541785297
--- FAIL: TestStreamingRegionalConstraint (64.20s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

This test on roachdash | Improve this report!

cockroach-teamcity commented 1 year ago

ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:

=== RUN   TestStreamingRegionalConstraint
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint2950345968
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
    NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
    NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.

pkg/ccl/streamingccl/streamingest/replication_stream_e2e_test.go:1221: (TestStreamingRegionalConstraint)
    NOTICE: .DB() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().DB() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
    replication_stream_e2e_test.go:1220: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 3
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint2950345968
--- FAIL: TestStreamingRegionalConstraint (123.16s)

Parameters: TAGS=bazel,gss,deadlock , stress=true

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

This test on roachdash | Improve this report!

cockroach-teamcity commented 1 year ago

ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:

=== RUN   TestStreamingRegionalConstraint
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint48417678
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
    NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
    NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.

pkg/ccl/streamingccl/streamingest/replication_stream_e2e_test.go:1221: (TestStreamingRegionalConstraint)
    NOTICE: .DB() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().DB() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
    replication_stream_e2e_test.go:1220: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 2
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint48417678
--- FAIL: TestStreamingRegionalConstraint (52.66s)

Parameters: TAGS=bazel,gss , stress=true

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

This test on roachdash | Improve this report!

msbutler commented 1 year ago

@stevendanna a git bisect revealed that https://github.com/cockroachdb/cockroach/pull/111178 causes

./dev test pkg/ccl/streamingccl/streamingest -f TestStreamingRegionalConstraint --stress --cpus=8 --count=20

to consisently fail on my gce worker. If I run the same command on the previous commit, the stress test passes no problem. Do you have any intuition on why the splits pr could affect our ability replicate spanconfigs?

msbutler commented 1 year ago

The most benign explanation of this regression: the initial splits lead to an occasional test suffering from more severe hardware starvation. When I modify the test to relax the time requirement to observe the replicated span configs by 5x, the test passes under stress 100 runs on my gce worker:

On 4 parallel cpus:

//pkg/ccl/streamingccl/streamingest:streamingest_test                    PASSED in 67.0s
  Stats over 100 runs: max = 67.0s, min = 7.2s, avg = 8.8s, dev = 5.9s

On 8 parallel cpus:

//pkg/ccl/streamingccl/streamingest:streamingest_test                    PASSED in 202.9s
  Stats over 100 runs: max = 202.9s, min = 9.2s, avg = 28.8s, dev = 39.7s

But with 12 parallel cpus: a test will fail after 3m45s. So, all this suggests that hardware exhaustion correlates to the this test timeout. We still ought to figure out how to speed up this test.

msbutler commented 1 year ago

Just noticed something fishy after bumping the vmodule on ingest_span_configs.go. Before I explain that, recall the timeline of this test:

  1. Init a single MR host cluster with a source app tenant and destination app tenant
  2. Start the replication stream
  3. Create a database with a regional constraint on the source app tenant
  4. Wait for the initial scan to complete
  5. Check that the source side has observed the regional constraint
  6. Check that the destination side has replicated and observed the regional constraint -- we fail here

On a green run: we observe the flushing of an incremental span config update, which presumably is the region config update:

❯ grep "ingest_span" out.txt
I231004 18:12:19.029687 21135 ccl/streamingccl/streamingest/ingest_span_configs.go:108  [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 704  initialized span config ingestor
I231004 18:12:19.042198 21283 ccl/streamingccl/streamingest/ingest_span_configs.go:255  [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 708  flushing span config 45 updates and 0 deletes
I231004 18:12:19.042242 21283 ccl/streamingccl/streamingest/ingest_span_configs.go:307  [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 709  flushing full span configuration state (45 records)
I231004 18:12:20.193548 21283 ccl/streamingccl/streamingest/ingest_span_configs.go:255  [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 718  flushing span config 1 updates and 0 deletes

On a failed run, we don't observe the incremental update on the destination side. But we do see the initial flush of the span config state:

❯ grep "ingest_span" streamingesttest.log
I231004 18:07:10.480362 55 util/log/file_sync_buffer.go:238 ⋮ [config]   arguments: [‹/home/michaelbutler/.cache/bazel/_bazel_michaelbutler/338fb3aeb5cca80cb3648387a412fd28/sandbox/linux-sandbox/3802/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test› ‹-test.timeout=895s› ‹-vmodule=ingest_span_configs=2›]
I231004 18:07:21.285026 23789 ccl/streamingccl/streamingest/ingest_span_configs.go:108 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905617212680601603›] 698  initialized span config ingestor
I231004 18:07:21.319160 24050 ccl/streamingccl/streamingest/ingest_span_configs.go:255 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905617212680601603›] 702  flushing span config 46 updates and 0 deletes
I231004 18:07:21.319200 24050 ccl/streamingccl/streamingest/ingest_span_configs.go:307 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905617212680601603›] 703  flushing full span configuration state (46 records)

So for whatever reason, either the span config event stream has not sent over the update or a subsequent checkpoint that would induce a destination side flush.

msbutler commented 1 year ago

Alright, I've confirmed with a bit more logging that the regional constraint was replicated and flushed to the destination side span config table (see below). The remaining mystery, which is outside of c2c land: why does it take so long to observe the regional constraints on the destination side.

I231004 19:16:43.593958 66 util/log/file_sync_buffer.go:238 ⋮ [config]   arguments: [‹/home/michaelbutler/.cache/bazel/_bazel_michaelbutler/338fb3aeb5cca80cb3648387a412fd28/sandbox/linux-sandbox/5529/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test› ‹-test.timeout=895s› ‹-vmodule=ingest_span_configs=2,span_config_event_stream=2›]
I231004 19:16:54.714223 24378 ccl/streamingccl/streamingest/ingest_span_configs.go:108 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 716  initialized span config ingestor
I231004 19:16:54.741312 24567 ccl/streamingccl/streamproducer/span_config_event_stream.go:186 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 718  observed complete scan
I231004 19:16:54.741400 24565 ccl/streamingccl/streamproducer/span_config_event_stream.go:267 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 719  observed constraint update [‹+region=mars›]
I231004 19:16:54.742470 24565 ccl/streamingccl/streamproducer/span_config_event_stream.go:274 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 720  checkpointing span config stream at 2023-10-04 19:16:54.740267269 +0000 UTC
I231004 19:16:54.742510 24565 ccl/streamingccl/streamproducer/span_config_event_stream.go:276 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 721  sending 49 span config events
I231004 19:16:54.758117 24466 ccl/streamingccl/streamingest/ingest_span_configs.go:255 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 723  flushing span config 46 updates and 0 deletes
I231004 19:16:54.758150 24466 ccl/streamingccl/streamingest/ingest_span_configs.go:307 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 724  flushing full span configuration state (46 records)
msbutler commented 1 year ago

I have one theory for why the split pr could be causing this test to slowdown: According to the test logs, we issue one initial split at the tenant start key:

❯ cat out.txt | grep "splitting and scattering"
I231004 19:16:54.651275 22316 ccl/streamingccl/streamingest/stream_ingestion_dist.go:362 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 712  splitting and scattering at /Tenant/2

I would expect this split to noop, since I believe there already exists a split at the tenant start key. More interestingly, we also call scatter on this tenant's key space, which will induce some allocator work. If some of this allocator work is asynchronous, then the span config replication stream will begin while the allocator is handling the scatter. The work to apply the regional constraint would then get put in the allocator queue behind this scatter request.

Another thought with no evidence: perhaps admission control is throttling the allocator more, after it handled the admin scatter request.

msbutler commented 1 year ago

Okie dokie, my theory above has some weight: commenting out the scatter request in stream_ingestion_dist.go erases the regression.

msbutler commented 1 year ago

Per @kvoli 's advice, I took a look at the kv distributions logs of a failed run. I noticed that after the scatter request and the span config update, I see the allocator attempt to transfer the tenant's range to s1 (perhaps to obey the regional constraint), but the attempt got rejected, then there's nothing in the logs for 20 seconds.

I231005 15:42:37.155598 24698 13@kv/kvserver/allocator/plan/replicate.go:1046 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 1180  transferring lease to s1
I231005 15:42:37.155635 24698 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 1181  transferring lease to s1
I231005 15:42:37.158345 25416 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n1,replicate,s1,r86/1:‹/Tenant/{2/Table/…-3}›] 1182  error processing replica: no removable replicas from range that needs a removal: ‹[1*:19, 3:0]›
I231005 15:42:56.761708 49682 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n1,replicate,s1,r34/1:‹/{NamespaceTab…-Table/32}›] 1183  transferring lease to s3

here's a zip of the latest logs:

testlogs.zip

msbutler commented 1 year ago

Two more things to note about the kv distribution log:

msbutler commented 1 year ago

I added more verbose logging (replicate=6) which revealed the allocator tried, for about 10 seconds, to rebalance the range to obey the regional constraint, but I don't see any logs which indicate any success. I'm not sure how to dig further in these logs.

For example:

❯ cat streamingesttest-kv-distribution.log| grep "mars" | { head -n 5; tail -n 5; }
I231005 18:11:50.502240 23678 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2975  computing range action desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.502598 24272 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2979  planning range change desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.502745 24402 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2981  computing range action desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.503042 24272 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2984  computing range action desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.503276 24219 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2986  planning range change desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:51.332612 25644 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3088  computing range action desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=4, gen=15] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:51.332873 25874 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3090  planning range change desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=4, gen=15] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:51.415052 25874 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3101  planning range change desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=4, gen=15] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:12:01.823578 37698 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3127  computing range action desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=5, gen=17] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:12:01.824062 37609 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3129  planning range change desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=5, gen=17] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›

To repro, run the following command on this branch:

 ./dev test pkg/ccl/streamingccl/streamingest -f TestStreamingRegionalConstraint --vmodule ingest_span_configs=2,span_config_event_stream=2,replicate=6 --stress --count=100 --cpus=8

Here are the test logs with verbose logging: testlogs.zip

msbutler commented 1 year ago

handing this over @stevendanna while I'm out.

msbutler commented 1 year ago

I'm removing the release blocker on this as the failure is to due test environment constraints outlined in https://github.com/cockroachdb/cockroach/issues/112541