apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store
https://apple.github.io/foundationdb/
Apache License 2.0
14.44k stars 1.31k forks source link

Create fdbcli command to force terminate all backup/DR mutation streams on a cluster #1111

Open alexmiller-apple opened 5 years ago

alexmiller-apple commented 5 years ago

From Aborting a DR/backup when destination is unreachable, if you fdbdr --abort without --cleanup, and then destroy the secondary cluster forever, the only way to get out of that situation is to manually issue clearrange commands.

We should build an equivalent command into fdbcli, as we probably don't wish for any operational changes to require manual set/clearrange commands.

xumengpanda commented 5 years ago

I assume the manual clearrange commands is still needed in the current version (6.0).

Maybe it can be marked as "good first issue" for anyone who wants to contributes to FDB?

alexmiller-apple commented 5 years ago

I don't know backup code, and it was unclear to me if refactoring out the code that stops mutation streams on the source cluster code in DatabaseBackupAgent out into its own function that could be invoked by fdbcli would be a better option than just an fdbcli command that issues two clear ranges.

satherton commented 5 years ago

While we could add something to just do the range clears with a warning like "only do this if you know what you are doing" and explain the requirements and side effects, with a little more work we can avoid some potential confusing side effects.

What we are effectively doing with these clears is ending and clearing all mutation streams the database is currently accumulating. There are in practice only two users of these streams currently, Backup and DR. It would be best to either abort all active Backup and DR tags or update Backup and DR to detect and react to their source mutation streams having been terminated in this manner.

Canceling all active Backups is easy, but unfortunately cancelling all active DR's is not since the DR execution state lives in the secondary clusters, each tag can use a different one, and secondary cluster connection strings (cluster file contents) are not stored in the database anywhere. So DR will have to detect and react to the mutation stream cancellation.

A reasonably user-friendly form that this tool could take is to have fdbdr reset_dr_and_backup -C <primary_cluster_file> which, in a single transaction, does the following:

And then the DR mutation stream read tasks must detect when the source mutation stream has disappeared (its configuration in the source database is gone) and react by aborting the DR to a new state, perhaps either aborted_by_reset or aborted_source_stream_ended. I believe the current behavior is to just wait and poll again, waiting for mutations to show up.

Alternatively, instead of aborting all the Backup tags the same end-of-stream detection could be added but just easier to abort them as part of the 'reset' transaction because their execution state exists in the same database.