Open nstewart opened 6 years ago
For customers using SQL EXPERIEMTNAL_AUDIT they could quickly search through the audit file, find the UTC timestamp of the offending query, and rollback to the instant before it was run,
Not that's not true.
1) The UTC timestamp in the audit log events has no relationship to the timestamp used in BACKUP/RESTORE. The former is wall clock, the latter is cluster logical timestamp. They can be off by seconds.
2) It can be off by more than seconds or minutes depending on how long the ALTER statement took: the aut log message timestamp is reported when the statement completes not when it starts.
So no please tell customers/users to stay away from the audit log to decide what to do otherwise they're up for disappointment and we'll be up for support pain.
Although that said I agree we should be doing a better job to help users here.
Noted, @knz. @jseldess @rmloveland - as far as I can tell we don't talk about the difference between wall clock and logical clock anywhere in our BACKUP/RESTORE docs. Is system time
supposed to mean logical clock time? That's not clear to me.
@Amruta-Ranade, your the best one to speak to Nate's question above, I think.
Discussed it with @dt ..He can explain it far better than me.
@nstewart you've nailed the question to the fundamentals.
1) backup/restore talk about cluster logical timestamps
2) we don't document cluster logical timestamps anywhere. From a user's perpective (public docs, advertised info) they don't exist
3) the various timestamps that are visible (transaction_timestamp()
, system clock in logs etc) are unsuitable.
We can't reconcile point 1 and 3 while 2 is unsolved. That's where the work should start. @bdarnell and @andreimatei had a fundamental discussion about this two weeks ago and you should probably follow up with them.
https://github.com/cockroachdb/cockroach/issues/19749 came up in this context too: we want to see when something changed so we know when we want to rollback to.
We only show wall-times in backup/restore AS OF SYSTEM TIME
examples which I think is fine -- operators currently are likely to know wall times, not logical ones, particularly when data corruption or whatever came from external sources. For example, I'd expect I flipped my load balancer over to the buggy version of my app at 11:00 this morning, so I need to rollback to 10:55 or so to be safe
to be far more common than at logical tick 527041 i wrote the first wrong value
.
Perhaps making cluster logical timestamps for job start and end visible in the web UI would help with the underlying issue of wanting to roll back to before a schema change happened. @piyush-singh, is this a scenario you've considered or something you're approaching from other angles?
@rolandcrosby this is the first time I've heard of this problem, so it is not something we had considered. I think we would be able to provide some huge usability wins via the UI. For example, if we indeed had some way to grab the logical timestamp from the moments before the schema change was triggered, we could list the history of schema changes on the table pages in the UI and give users a button to copy/paste the restore commands to restore the cluster immediately before the change they'd like to revert. Happy to discuss this further, and agreed that the UI could help.
@piyush-singh Great, glad you think this would be helpful, I'll do some more research around problems to solve here and discuss with you. In looking at this some more, I think we have an opportunity to improve confidence and understanding of schema changes through both the SQL CLI and the web UI, especially with the changes you're evaluating to surface schemas in context in places like the statements page.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
We currently don't provide enough visibility into WHEN a fat finger mistake occurred. This makes it difficult to determine the exact moment to backup / restore.
For customers using SQL EXPERIEMTNAL_AUDIT they could quickly search through the audit file, find the UTC timestamp of the offending query, and rollback to the instant before it was run, minimizing data loss on restore and minimizing downtime (since they don't have to keep rewinding/fast-forwarding to find a recovery point). Because EXERIMENTAL_AUDIT is expensive and not on by default we need another option.
@knz noted the SQL event log could be extended to include the last logical cluster time known to be valid before the schema change operation started. It's on by default and dbas could quickly go through that to find the restore point.
This isn't a blocker to GA point in time restore but this is a usability miss that we should fix this release.
Jira issue: CRDB-5691