cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.96k stars 3.79k forks source link

online restore: nodes fail to restart after online restore with file size errors #120253

Open dt opened 6 months ago

dt commented 6 months ago

On a node that had completed an online restore and had partially completed the download job, using a recent build with copy compactions, subsequently restarting the nodes failed, with errors of the form:

E240311 18:01:25.506855 1 1@cli/clierror/check.go:35 ⋮ [-] 12 +L6: 1437750: object size mismatch (‹/mnt/data1/cockroach/1437750.sst›): 146311389 (disk) != 295967218 (MANIFEST)
E240311 18:01:25.506855 1 1@cli/clierror/check.go:35 ⋮ [-] 12 +L6: 1437810: object size mismatch (‹/mnt/data1/cockroach/1437810.sst›): 147222939 (disk) != 290265248 (MANIFEST)
E240311 18:01:25.506855 1 1@cli/clierror/check.go:35 ⋮ [-] 12 +L6: 1437811: object size mismatch (‹/mnt/data1/cockroach/1437811.sst›): 148048145 (disk) != 86695858 (MANIFEST)
E240311 18:01:25.506855 1 1@cli/clierror/check.go:35 ⋮ [-] 12 +L6: 1437812: object size mismatch (‹/mnt/data1/cockroach/1437812.sst›): 146060907 (disk) != 106969408 (MANIFEST)
E240311 18:01:25.506855 1 1@cli/clierror/check.go:35 ⋮ [-] 12 +L6: 1437906: object size mismatch (‹/mnt/data1/cockroach/1437906.sst›): 146408479 (disk) != 208488547 (MANIFEST)
E240311 18:01:25.506855 1 1@cli/clierror/check.go:35 ⋮ [-] 12 +L6: 1437907: object size mismatch (‹/mnt/data1/cockroach/1437907.sst›): 148048145 (disk) != 251659198 (MANIFEST)
E240311 18:01:25.506855 1 1@cli/clierror/check.go:35 ⋮ [-] 12 +L6: 1437908: object size mismatch (‹/mnt/data1/cockroach/1437908.sst›): 145191537 (disk) != 202325081 (MANIFEST)
E240311 18:01:25.506855 1 1@cli/clierror/check.go:35 ⋮ [-] 12 +L6: 1437911: object size mismatch (‹/mnt/data1/cockroach/1437911.sst›): 148048145 (disk) != 28244388 (MANIFEST)
E240311 18:01:25.506855 1 1@cli/clierror/check.go:35 ⋮ [-] 12 +L6: 1437912: object size mismatch (‹/mnt/data1/cockroach/1437912.sst›): 147080858 (disk) != 173082996 (MANIFEST)
E240311 18:01:25.506855 1 1@cli/clierror/check.go:35 ⋮ [-] 12 +L6: 1437913: object size mismatch (‹/mnt/data1/cockroach/1437913.sst›): 144742335 (disk) != 319329200 (MANIFEST)

Unfortunately the cluster is no longer available to collect the actual lsm state or directory contents but a few questions to follow up on: a) can we reproduce this? b) are we checking the actual sizes -- with an RPC to the storage provider -- for every single file on start? or is this only downloaded disk files? c) And what is the size in the manifest/where did we get it?

Jira issue: CRDB-36565

blathers-crl[bot] commented 6 months ago

cc @cockroachdb/disaster-recovery

blathers-crl[bot] commented 6 months ago

cc @cockroachdb/disaster-recovery

blathers-crl[bot] commented 6 months ago

Hi @dt, please add branch-* labels to identify which branch(es) this GA-blocker affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.