cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.88k stars 3.77k forks source link

roachtest: backup-restore/round-trip failed #128653

Open cockroach-teamcity opened 1 month ago

cockroach-teamcity commented 1 month ago

roachtest.backup-restore/round-trip failed with artifacts on release-24.1.3-rc @ 9e10212477fde97f55ea4fff01797288c836575c:

(monitor.go:154).Wait: monitor failure: full command output in run_061358.760017175_n4_COCKROACHRANDOMSEED6.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/backup-restore/round-trip/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-41126

stevendanna commented 1 month ago

This looks a bit scarry:

We were creating a schema:

  "BEGIN",
  {
   "sql": "CREATE SCHEMA IF NOT EXISTS schema_w1_210 AUTHORIZATION root"
  },
  "COMMIT"
 ],

And got:

Error: ***UNEXPECTED COMMIT ERROR; Received an unexpected commit error: ERROR: error executing PreCommitPhase stage 2 of 2 with 8 MutationType ops: persisting catalog mutations: unexpected value: raw_bytes:"\302\026\260q\003\022\365\014\n\014schemachange\020{\0325\n\013\n\005admin\020\002\030\002\n\r\n\006public\020\200\020\030\000\n\n\n\004root\020\002\030\002\022\troachprod\030\003\"\000(!:\014\n\006public\022\002\010|:\024\n\rschema_w0_120\022\003\010\262\001:\024\n\rschema_w0_131\022\003\010\263\001:\024\n\rschema_w0_134\022\003\010\270\001:\024\n\rschema_w0_146\022\003\010\300\001:\024\n\rschema_w0_157\022\003\010\307\001:\024\n\rschema_w0_165\022\003\010\327\001:\024\n\rschema_w0_180\022\003\010\353\001:\024\n\rschema_w0_232\022\003\010\261\002:\023\n\014schema_w0_46\022\003\010\223\001:\023\n\014schema_w0_64\022\003\010\235\001:\023\n\014schema_w0_74\022\003\010\241\001:\023\n\014schema_w0_78\022\003\010\245\001:\023\n\014schema_w0_83\022\003\010\251\001:\024\n\rschema_w1_116\022\003\010\266\001:\024\n\rschema_w1_122\022\003\010\276\001:\024\n\rschema_w1_128\022\003\010\303\001:\024\n\rschema_w1_136\022\003\010\312\001:\024\n\rschema_w1_156\022\003\010\344\001:\024\n\rschema_w1_160\022\003\010\351\001:\024\n\rschema_w1_188\022\003\010\237\002:\023\n\014schema_w1_41\022\003\010\222\001:\023\n\014schema_w1_43\022\003\010\224\001:\023\n\014schema_w1_50\022\003\010\234\001:\023\n\014schema_w1_66\022\003\010\242\001:\023\n\014schema_w1_72\022\003\010\250\001:\023\n\014schema_w1_76\022\003\010\254\001:\023\n\014schema_w1_87\022\003\010\261\001:\023\n\014schema_w1_94\022\003\010\264\001:\024\n\rschema_w2_100\022\003\010\257\001:\023\n\014schema_w2_11\022\003\010\216\001:\024\n\rschema_w2_110\022\003\010\265\001:\024\n\rschema_w2_113\022\003\010\267\001:\024\n\rschema_w2_119\022\003\010\301\001:\024\n\rschema_w2_121\022\003\010\304\001:\024\n\rschema_w2_122\022\003\010\305\001:\024\n\rschema_w2_123\022\003\010\306\001:\024\n\rschema_w2_143\022\003\010\316\001:\024\n\rschema_w2_146\022\003\010\321\001:\024\n\rschema_w2_153\022\003\010\343\001:\024\n\rschema_w2_166\022\003\010\370\001:\024\n\rschema_w2_179\022\003\010\211\002:\023\n\014schema_w2_20\022\003\010\220\001:\024\n\rschema_w2_211\022\003\010\254\002:\024\n\rschema_w2_216\022\003\010\263\002:\022\n\013schema_w2_4\022\003\010\215\001:\023\n\014schema_w2_98\022\003\010\256\001:\024\n\rschema_w3_136\022\003\010\272\001:\024\n\rschema_w3_138\022\003\010\275\001:\024\n\rschema_w3_170\022\003\010\310\001:\024\n\rschema_w3_197\022\003\010\362\001:\024\n\rschema_w3_202\022\003\010\215\002:\024\n\rschema_w3_207\022\003\010\232\002:\024\n\rschema_w3_211\022\003\010\235\002:\024\n\rschema_w3_217\022\003\010\265\002:\023\n\014schema_w3_55\022\003\010\236\001:\023\n\014schema_w3_69\022\003\010\253\001:\023\n\014schema_w4_10\022\003\010\217\001:\024\n\rschema_w4_131\022\003\010\271\001:\024\n\rschema_w4_141\022\003\010\302\001:\024\n\rschema_w4_166\022\003\010\311\001:\024\n\rschema_w4_180\022\003\010\365\001:\024\n\rschema_w4_183\022\003\010\374\001:\024\n\rschema_w4_185\022\003\010\205\002:\024\n\rschema_w4_190\022\003\010\217\002:\024\n\rschema_w4_194\022\003\010\225\002:\024\n\rschema_w4_204\022\003\010\251\002:\022\n\013schema_w4_4\022\003\010\214\001:\023\n\014schema_w4_61\022\003\010\237\001:\023\n\014schema_w4_69\022\003\010\246\001:\023\n\014schema_w4_70\022\003\010\252\001:\023\n\014schema_w4_82\022\003\010\255\001:\023\n\014schema_w4_89\022\003\010\260\001@\000J\000Z\002\020\000" timestamp:<wall_time:1723184068115477289 >  (SQLSTATE XXUUU)
stevendanna commented 1 month ago

Wondering if foundations can help take a look here.

rafiss commented 1 month ago

@Dedej-Bergin Can you try to reproduce this error during your on-call?

To start, we can use:

./dev test pkg/ccl/testccl/workload/schemachange/ --stress --count=200 -f=TestWorkload

to run the workload locally. Since this seems related to schema creation, try doing this with the weights for CREATE and DROP SCHEMA increased here: https://github.com/cockroachdb/cockroach/blob/bb2613e5348ea864c6a9cc348af73117d701f453/pkg/workload/schemachange/optype.go#L291

We should test on the 24.1 branch.

Dedej-Bergin commented 1 month ago

Hi @rafiss

I was able to reproduce this error by running schemchange with CREATE and DROP SCHEMA weighted at 1000. But I'm not sure if this is the same error that we were looking for?

{
 "workerId": 0,
 "clientTimestamp": "19:24:33.341219",
 "ops": [
 "BEGIN",
 {
  "sql": "DROP SCHEMA \"schema_w1_44\" CASCADE"
 },
 "COMMIT"
 ],
 "expectedExecErrors": "",
 "expectedCommitErrors": "",
 "message": ""
}
{
 "workerId": 1,
 "clientTimestamp": "19:24:33.480478",
 "ops": [
 "BEGIN",
 {
  "sql": "DROP SCHEMA \"schema_w0_73\" CASCADE"
 }
 ],
 "expectedExecErrors": "",
 "expectedCommitErrors": "",
 "message": "ERROR: restart transaction: TransactionRetryWithProtoRefreshError: WriteTooOldError: write for key /Table/3/1/104/2/1 at timestamp 1723577073.480271000,0 too old; must write at or above 1723577073.523323000,1: \"sql txn\" meta={id=9ac5901c key=/Table/24/1/4/196/0/0 iso=Serializable pri=0.00434769 epo=0 ts=1723577073.523323000,1 min=1723577073.480271000,0 seq=5} lock=true stat=PENDING rts=1723577073.480271000,0 wto=false gul=1723577073.980271000,0 (SQLSTATE 40001)"
}
{
 "workerId": 1,
 "clientTimestamp": "19:24:33.532766",
 "ops": [
 "BEGIN",
 {
  "sql": "DROP SCHEMA \"schema_w1_64\" CASCADE",
  "expectedExecErr": "3F000"
 }
 ],
 "expectedExecErrors": "3F000",
 "expectedCommitErrors": "",
 "message": "ROLLBACK; Successfully got expected execution error.: ERROR: unknown schema \"schema_w1_64\" (SQLSTATE 3F000)"
}
{
 "workerId": 0,
 "clientTimestamp": "19:24:33.497924",
 "ops": [
 "BEGIN",
 {
  "sql": "CREATE SCHEMA IF NOT EXISTS schema_w0_80 AUTHORIZATION root"
 },
 "COMMIT"
 ],
 "expectedExecErrors": "",
 "expectedCommitErrors": "",
 "message": ""
}
{
 "workerId": 1,
 "clientTimestamp": "19:24:33.541701",
 "ops": [
 "BEGIN",
 {
  "sql": "CREATE SCHEMA IF NOT EXISTS schema_w1_65 AUTHORIZATION root"
 },
 "COMMIT"
 ],
 "expectedExecErrors": "",
 "expectedCommitErrors": "",
 "message": "TXN RETRY ERROR; ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to conflicting locks on /NamespaceTable/30/1/104/0/\"schema_w0_81\"/4/1 [reason=wait_policy] - conflicting txn: meta={id=e84718f0 key=/Table/3/1/218/2/1 iso=Serializable pri=0.05714642 epo=0 ts=1723577073.542676000,0 min=1723577073.542676000,0 seq=3}): \"sql txn\" meta={id=2a78d868 key=/Table/3/1/217/2/1 iso=Serializable pri=0.00221832 epo=0 ts=1723577073.542676000,1 min=1723577073.541469000,0 seq=3} lock=true stat=PENDING rts=1723577073.541469000,0 wto=false gul=1723577074.041469000,0 (SQLSTATE 40001)"
}

Schema Workload Stats
Total Schema Statements Executed = 212
Total Schema Statements Succeeded = 212
Total Schema Statement Expected Failures = 0
Total Transactions Committed = 95
Total Transactions Rolled Back = 105
Total Transactions Executed = 200
PASS
================================================================================
Target //pkg/ccl/testccl/workload/schemachange:schemachange_test up-to-date:
  _bazel/bin/pkg/ccl/testccl/workload/schemachange/schemachange_test_/schemachange_test
ERROR: Test failed, aborting
INFO: Elapsed time: 1325.569s, Critical Path: 49.44s
INFO: 141 processes: 12 internal, 129 darwin-sandbox.
INFO: Build completed successfully, 141 total actions
//pkg/ccl/testccl/workload/schemachange:schemachange_test             NO STATUS
  /private/var/tmp/_bazel_bdedej/3d5e8839f3c3e22b0d07becaa01ecc28/execroot/com_github_cockroachdb_cockroach/bazel-out/darwin_arm64-fastbuild/testlogs/pkg/ccl/testccl/workload/schemachange/schemachange_test/run_131_of_200/test.log

Executed 1 out of 1 test: 1 was skipped.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.
INFO: Build Event Protocol files produced successfully.
ERROR: exit status 3
stevendanna commented 1 month ago

@Dedej-Bergin I'm not sure that is really a reproduction. While your log output shows some TransactionRetryWithProtoRefreshError errors. It does show the unexpected value error that we got in the original test.

rafiss commented 1 month ago

Yeah, our workload is built to be able to tolerate the TransactionRetryWithProtoRefreshError error.

Also, in your comment, I see at the bottom that it says PASS

Schema Workload Stats
Total Schema Statements Executed = 212
Total Schema Statements Succeeded = 212
Total Schema Statement Expected Failures = 0
Total Transactions Committed = 95
Total Transactions Rolled Back = 105
Total Transactions Executed = 200
PASS

but further down it says that the test failed. I'm not sure what the failure was.

Anyway, to reproduce this, we should keep trying until we see the unexpected value: raw_bytes error like Steven says.

Dedej-Bergin commented 1 month ago

Current update: I have not been able to repro this issue.

I have tried running ./dev test pkg/ccl/testccl/workload/schemachange/ --stress --count=200 -f=TestWorkload several times on 24.1 with seed COCKROACH_RANDOM_SEED=6359055215961277801 .

I got the seed from the artifacts link above in this log file: https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_RoachtestNightlyGceBazel/16387832:id/backup-restore/round-trip/run_1/artifacts.zip!/run_061358.760017175_n4_COCKROACHRANDOMSEED6.log

rafiss commented 1 month ago

Thanks @Dedej-Bergin! This seems rare enough that we can remove the release-blocker tag, but let's sync up offline about different ways we can try to reproduce this.

stevendanna commented 1 month ago

I wonder if we can build a TestServer-based test that includes some of the backup/restore work that the roachtest is doing to see if that is somehow in play here.