cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

Make Restores more performant/resilient with very large operations across ISPs #97818

Open benbardin opened 1 year ago

benbardin commented 1 year ago

In attempting to restore an 8TB database from AWS Virginia to Azure Iowa, we encountered repeated "TLS: Bad MAC record" errors that broke the restore. We suspect glitchy intermediate hardware and possibly a bug in the golang crypo library, but were unable to figure this out.

In attempting to restore an 8TB database from AWS Virginia to GCP South Carolina, we encountered repeated exhausted retries: importing 14457 ranges: inbox communication error: grpc: context cancelled messages. These paused the restore job, instead of cancelling outright, but it's still weird. This error isn't deterministic - resuming the job, repeatedly, enabled more progress.

The fixture in question is the 8TB TPCE workload fixture.

The common thread could be AWS, but we don't seem to have issues like these on backups. It seems likely there's room to make Restore more resilient to network instability.

Jira issue: CRDB-24900

Epic CRDB-20915

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/disaster-recovery