backupccl: make it easy and reliable to find BACKUP/RESTORE timestamps

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

30.13k stars 3.81k forks source link

backupccl: make it easy and reliable to find BACKUP/RESTORE timestamps #25803

Open nstewart opened 6 years ago

nstewart commented 6 years ago

We currently don't provide enough visibility into WHEN a fat finger mistake occurred. This makes it difficult to determine the exact moment to backup / restore.

For customers using SQL EXPERIEMTNAL_AUDIT they could quickly search through the audit file, find the UTC timestamp of the offending query, and rollback to the instant before it was run, minimizing data loss on restore and minimizing downtime (since they don't have to keep rewinding/fast-forwarding to find a recovery point). Because EXERIMENTAL_AUDIT is expensive and not on by default we need another option.

@knz noted the SQL event log could be extended to include the last logical cluster time known to be valid before the schema change operation started. It's on by default and dbas could quickly go through that to find the restore point.

This isn't a blocker to GA point in time restore but this is a usability miss that we should fix this release.

Jira issue: CRDB-5691

knz commented 6 years ago

For customers using SQL EXPERIEMTNAL_AUDIT they could quickly search through the audit file, find the UTC timestamp of the offending query, and rollback to the instant before it was run,

Not that's not true.

1) The UTC timestamp in the audit log events has no relationship to the timestamp used in BACKUP/RESTORE. The former is wall clock, the latter is cluster logical timestamp. They can be off by seconds.

2) It can be off by more than seconds or minutes depending on how long the ALTER statement took: the aut log message timestamp is reported when the statement completes not when it starts.

So no please tell customers/users to stay away from the audit log to decide what to do otherwise they're up for disappointment and we'll be up for support pain.

Although that said I agree we should be doing a better job to help users here.

nstewart commented 6 years ago

Noted, @knz. @jseldess @rmloveland - as far as I can tell we don't talk about the difference between wall clock and logical clock anywhere in our BACKUP/RESTORE docs. Is system time supposed to mean logical clock time? That's not clear to me.

jseldess commented 6 years ago

@Amruta-Ranade, your the best one to speak to Nate's question above, I think.

Amruta-Ranade commented 6 years ago

Discussed it with @dt ..He can explain it far better than me.

knz commented 6 years ago

@nstewart you've nailed the question to the fundamentals.

1) backup/restore talk about cluster logical timestamps

2) we don't document cluster logical timestamps anywhere. From a user's perpective (public docs, advertised info) they don't exist

3) the various timestamps that are visible (transaction_timestamp(), system clock in logs etc) are unsuitable.

We can't reconcile point 1 and 3 while 2 is unsolved. That's where the work should start. @bdarnell and @andreimatei had a fundamental discussion about this two weeks ago and you should probably follow up with them.

dt commented 6 years ago

https://github.com/cockroachdb/cockroach/issues/19749 came up in this context too: we want to see when something changed so we know when we want to rollback to.

We only show wall-times in backup/restore AS OF SYSTEM TIME examples which I think is fine -- operators currently are likely to know wall times, not logical ones, particularly when data corruption or whatever came from external sources. For example, I'd expect I flipped my load balancer over to the buggy version of my app at 11:00 this morning, so I need to rollback to 10:55 or so to be safe to be far more common than at logical tick 527041 i wrote the first wrong value.

rolandcrosby commented 5 years ago

Perhaps making cluster logical timestamps for job start and end visible in the web UI would help with the underlying issue of wanting to roll back to before a schema change happened. @piyush-singh, is this a scenario you've considered or something you're approaching from other angles?

piyush-singh commented 5 years ago

@rolandcrosby this is the first time I've heard of this problem, so it is not something we had considered. I think we would be able to provide some huge usability wins via the UI. For example, if we indeed had some way to grab the logical timestamp from the moments before the schema change was triggered, we could list the history of schema changes on the table pages in the UI and give users a button to copy/paste the restore commands to restore the cluster immediately before the change they'd like to revert. Happy to discuss this further, and agreed that the UI could help.

rolandcrosby commented 5 years ago

@piyush-singh Great, glad you think this would be helpful, I'll do some more research around problems to solve here and discuss with you. In looking at this some more, I think we have an opportunity to improve confidence and understanding of schema changes through both the SQL CLI and the web UI, especially with the changes you're evaluating to surface schemas in context in places like the statements page.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!