If all replicas are lost, a StorageServer could source a shard replication from a backup.

satherton commented 4 years ago

If all replicas of a shard or set of shards are lost, it is actually possible, but slow, to restore them from an active backup.

Note that this plan assumes that although the shard is not readable it is still possible to commit blind writes to it. If we remove this requirement then the complexity is greatly reduced.

The sequence is roughly

Use the backup metadata to find the Key-Value Range Snapshot files relevant to the target set of shards and load the relevant ranges from those files.
Use the backup log stream to update each of the loaded ranges to a version which still exists in the FDB log system for the target shards.
Switch to using the FDB log system as the source of mutations (1-2 minutes behind, see below)
Keep applying until caught up.

There are of course a lot of details being glossed over here. Here are the ones I can think of:

To make the restore efficient, Data Distribution should assign a single StorageServer with as many missing (0 replicas) shards as possible so that one restore can handle them all, since every execution of this process must read the entire mutation log from the snapshot versions to current. It should be possible to use the current average shard size to make an educated guess at this.
The mutations for the shards being restored will be in the mutation tag stream for the StorageServer loading the data from backup, but they should not be applied when read from the log system.
The StorageServer must still pop its tag from the log system after applying mutations (and ignoring the mutations destined for the shards being loaded) but it should pop on a delay of probably 1-2 minutes. This lag provides overlap with the most recently flushed/durable backup data so that the restore can switch from the backup log stream to the log system log stream which of course reaches all the way to the present version.

If instead blind writes to the lost shards is not allowed, then there is no need to switch to the log system as a mutation source in the context of the restore process. Once the backup mutation log has been used to update the shards to a data version at or greater than the point where each shard, respectively, was lost, then the shard can be brought back online.

Also, without the writability requirement then it could be argued that a separate selective restore using the existing process is the route to take. That's up for debate, but I rather like the elegance that DataDistribution could start this process automatically after shards are missing for some time, using the active backup on the default tag, and then could cancel the process if any of the shard replicas come back online.

dongxinEric commented 4 years ago

This is sort of related to #1002 , basically this is bulk loading a shard into one/multiple storage server(s).

satherton commented 4 years ago

After thinking about this some more, the right process is probably just for DD to kick off a FastRestore of the lost shards into the cluster.

@dongxinEric Certainly related, though even without that improvement FastRestore (with some changes) could be used to restore missing shards in a live cluster, it would just be slower going through the log system.

There's still the complexity (or not) of supporting blind writes on the missing shards during the restore. If the shards will remain writable then FastRestore must continue pulling mutations from the backup until it catches up to the log system as described above.

xumengpanda commented 4 years ago

IMO, the key to restoring when several SSes are lost is to get the shards whose replicas are all on these lost SSes.

This requires backing up the shard-to-SS mapping in the normal backup process.

When multiple SSes are lost, the fast restore can first restore the shard-to-SS mapping metadata, figure out which shards to restore, and restore them to another cluster or to the original cluster as usual.

apple / foundationdb

If all replicas are lost, a StorageServer could source a shard replication from a backup. #3699