cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.12k stars 3.81k forks source link

roachtest: disagg-rebalance/aws/n4cpu4 failed #113732

Closed cockroach-teamcity closed 1 year ago

cockroach-teamcity commented 1 year ago

roachtest.disagg-rebalance/aws/n4cpu4 failed with artifacts on master @ 694861a16c8d72a52ac059ef82cf2763ca4406b0:

(monitor.go:153).Wait: monitor failure: full command output in run_063854.624855557_n1_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/disagg-rebalance/aws/n4cpu4/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_metamorphicBuild=true , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) _Grafana is not yet available for aws clusters_

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-33126

cockroach-teamcity commented 1 year ago

roachtest.disagg-rebalance/aws/n4cpu4 failed with artifacts on master @ 4d045594e8c65b56c82fcf2a1f14ee30cecfef3d:

(monitor.go:153).Wait: monitor failure: full command output in run_053509.244780444_n1_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/disagg-rebalance/aws/n4cpu4/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_metamorphicBuild=true , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) _Grafana is not yet available for aws clusters_

This test on roachdash | Improve this report!

sumeerbhola commented 1 year ago

Failing because of an error during the import

E231104 05:37:30.213024 22028 kv/kvserver/replica_consistency.go:778 ⋮ [T1,Vsystem,n3,s3,r85/3:‹/Table/113/1/1{3/333…-4/653…}›] 211  checksum computation failed: pebble: shared foreign sstable has a lower table format than expected

A lot of snapshot ingestion is happening. Unrelated to the failure, we see a lot of compactions cancelled, presumably because of IngestAndExcise. Do we quantify how many bytes we read/wrote in compactions that got cancelled and include them in Metrics (I didn't see anything with a cursory look at metrics.go)?

I231104 05:36:16.729082 15945 kv/kvserver/replica_raftstorage.go:579 ⋮ [T1,Vsystem,n3,s3,r48/3:‹/Table/4{5-6}›] 175  applied snapshot a01c6fd8 from (n1,s1):1 at applied index 20 (total=448ms data=693 B excise=true ingestion=6@257ms)
I231104 05:36:17.934059 16166 kv/kvserver/replica_raftstorage.go:579 ⋮ [T1,Vsystem,n3,s3,r11/3:‹/Table/{7-8}›] 176  applied snapshot ed3dc0ae from (n1,s1):1 at applied index 35 (total=600ms data=715 B shared=1 sharedSize=12 KiB excise=true ingestion=6@471ms)
E231104 05:36:17.987292 16203 3@pebble/event.go:696 ⋮ [n3,s3,pebble] 177  background error: pebble: compaction cancelled by a concurrent operation, will retry compaction
E231104 05:36:19.685225 16482 3@pebble/event.go:696 ⋮ [n3,s3,pebble] 178  background error: pebble: compaction cancelled by a concurrent operation, will retry compaction
I231104 05:36:19.740020 16337 kv/kvserver/replica_raftstorage.go:579 ⋮ [T1,Vsystem,n3,s3,r3/3:‹/System/{NodeLive…-tsd}›] 179  applied snapshot 475b7d93 from (n1,s1):1 at applied index 76 (total=628ms data=365 KiB excise=true ingestion=6@531ms)
I231104 05:36:21.013004 16620 kv/kvserver/replica_raftstorage.go:579 ⋮ [T1,Vsystem,n3,s3,r23/3:‹/Table/2{0-1}›] 180  applied snapshot 5722aac3 from (n1,s1):1 at applied index 577 (total=657ms data=37 KiB shared=2 sharedSize=33 KiB excise=true ingestion=6@539ms)
E231104 05:36:21.047639 16684 3@pebble/event.go:696 ⋮ [n3,s3,pebble] 181  background error: pebble: compaction cancelled by a concurrent operation, will retry compaction
I231104 05:36:22.255304 16778 kv/kvserver/replica_raftstorage.go:579 ⋮ [T1,Vsystem,n3,s3,r29/3:‹/Table/2{6-7}›] 182  applied snapshot 0c4fe160 from (n1,s1):1 at applied index 36 (total=650ms data=1.3 KiB excise=true ingestion=6@538ms)
I231104 05:36:22.886299 16946 kv/kvserver/replica_raftstorage.go:579 ⋮ [T1,Vsystem,n3,s3,r61/3:‹/Table/{59-60}›] 183  applied snapshot 19e6637e from (n1,s1):1 at applied index 20 (total=586ms data=693 B excise=true ingestion=6@308ms)
I231104 05:36:23.878976 17146 kv/kvserver/replica_raftstorage.go:579 ⋮ [T1,Vsystem,n3,s3,r55/3:‹/Table/5{3-4}›] 184  applied snapshot ca4918dd from (n1,s1):1 at applied index 1565 (total=536ms data=222 KiB shared=2 sharedSize=80 KiB excise=true ingestion=6@422ms)
I231104 05:36:25.539869 17408 kv/kvserver/replica_raftstorage.go:579 ⋮ [T1,Vsystem,n3,s3,r51/3:‹/Table/{48-50}›] 185  applied snapshot 15b07722 from (n1,s1):1 at applied index 20 (total=411ms data=705 B shared=1 sharedSize=14 KiB excise=true ingestion=6@279ms)
I231104 05:36:27.097738 17635 kv/kvserver/replica_raftstorage.go:579 ⋮ [T1,Vsystem,n3,s3,r6/3:‹/Table/{0-3}›] 186  applied snapshot d747d4f2 from (n1,s1):1 at applied index 20 (total=442ms data=693 B excise=true ingestion=6@364ms)
E231104 05:36:27.299577 17701 3@pebble/event.go:696 ⋮ [n3,s3,pebble] 187  background error: pebble: compaction cancelled by a concurrent operation, will retry compaction
sumeerbhola commented 1 year ago

It's the code in https://github.com/cockroachdb/cockroach/blob/4a53d0b7b014d58cb4498ffd8e50c031997a4020/pkg/storage/sst_writer.go#L75-L77 coupled with

initialized metamorphic constant "storage.value_blocks.enabled" with value false